2025-05-23-12-07

Logic-of-Thought: Empowering Large Language Models with Logic Programs for Solving Puzzles in Natural Language

Abstract

arXiv:2505.16114v1 Announce Type: new Abstract: Solving puzzles in natural language poses a long-standing challenge in AI. While large language models (LLMs) have recently shown impressive capabilities in a variety of tasks, they continue to struggle with complex puzzles that demand precise reasoning and exhaustive search. In this paper, we propose Logic-of-Thought (Logot), a novel framework that bridges LLMs with logic programming to address this problem. Our method leverages LLMs to translate puzzle rules and states into answer set programs (ASPs), the solution of which are then accurately and efficiently inferred by an ASP interpreter. This hybrid approach combines the natural language understanding of LLMs with the precise reasoning capabilities of logic programs. We evaluate our method on various grid puzzles and dynamic puzzles involving actions, demonstrating near-perfect accuracy across all tasks. Our code and data are available at: https://github.com/naiqili/Logic-of-Thought.

摘要

解决自然语言中的谜题是人工智能领域一项长期存在的挑战。尽管大型语言模型（LLM）近期在各类任务中展现出卓越性能，但其在需要精确推理和穷尽搜索的复杂谜题上仍存在困难。本文提出"逻辑思维"（Logot）这一创新框架，通过将LLM与逻辑编程相结合来解决该问题。我们的方法利用LLM将谜题规则和状态转换为答案集程序（ASP），随后由ASP解释器进行准确高效的推理求解。这种混合方法融合了LLM的自然语言理解能力与逻辑程序的精确推理优势。我们在多种网格谜题和涉及动作的动态谜题上评估本方法，所有任务均展现出接近完美的准确率。代码与数据详见：https://github.com/naiqili/Logic-of-Thought。

SPhyR: Spatial-Physical Reasoning Benchmark on Material Distribution

Abstract

arXiv:2505.16048v1 Announce Type: new Abstract: We introduce a novel dataset designed to benchmark the physical and spatial reasoning capabilities of Large Language Models (LLM) based on topology optimization, a method for computing optimal material distributions within a design space under prescribed loads and supports. In this dataset, LLMs are provided with conditions such as 2D boundary, applied forces and supports, and must reason about the resulting optimal material distribution. The dataset includes a variety of tasks, ranging from filling in masked regions within partial structures to predicting complete material distributions. Solving these tasks requires understanding the flow of forces and the required material distribution under given constraints, without access to simulation tools or explicit physical models, challenging models to reason about structural stability and spatial organization. Our dataset targets the evaluation of spatial and physical reasoning abilities in 2D settings, offering a complementary perspective to traditional language and logic benchmarks.

摘要

我们提出一个新颖的数据集，旨在基于拓扑优化方法评估大型语言模型（LLM）的物理与空间推理能力。该数据集通过给定二维边界、作用力及支撑条件，要求模型推理出最优材料分布。数据集包含多种任务类型，包括补全局部结构中的掩蔽区域，以及预测完整材料分布等。解决这些任务需要理解给定约束条件下的力流传递与材料分布需求，且不依赖仿真工具或显式物理模型，从而对模型的结构稳定性与空间组织推理能力形成挑战。本数据集专注于二维环境下的空间与物理推理能力评估，为传统语言和逻辑基准测试提供了补充性视角。

Causal LLM Routing: End-to-End Regret Minimization from Observational Data

Abstract

arXiv:2505.16037v1 Announce Type: new Abstract: LLM routing aims to select the most appropriate model for each query, balancing competing performance metrics such as accuracy and cost across a pool of language models. Prior approaches typically adopt a decoupled strategy, where the metrics are first predicted and the model is then selected based on these estimates. This setup is prone to compounding errors and often relies on full-feedback data, where each query is evaluated by all candidate models, which is costly to obtain and maintain in practice. In contrast, we learn from observational data, which records only the outcome of the model actually deployed. We propose a causal end-to-end framework that learns routing policies by minimizing decision-making regret from observational data. To enable efficient optimization, we introduce two theoretically grounded surrogate objectives: a classification-based upper bound, and a softmax-weighted regret approximation shown to recover the optimal policy at convergence. We further extend our framework to handle heterogeneous cost preferences via an interval-conditioned architecture. Experiments on public benchmarks show that our method outperforms existing baselines, achieving state-of-the-art performance across different embedding models.

摘要

大语言模型路由（LLM routing）旨在为每个查询选择最合适的模型，在语言模型池中平衡准确性与成本等竞争性性能指标。现有方法通常采用解耦策略，即先预测各项指标，再基于这些估计值选择模型。这种设置容易导致误差累积，且往往依赖全反馈数据（即每个查询需经所有候选模型评估），其获取和维护成本高昂。与之相反，我们利用观察数据（仅记录实际部署模型的输出结果）进行学习。本文提出一个因果端到端框架，通过最小化观察数据中的决策遗憾来学习路由策略。为实现高效优化，我们引入两个理论完备的替代目标：基于分类的上界，以及经证明能在收敛时恢复最优策略的softmax加权遗憾近似。我们进一步通过区间条件架构扩展框架以处理异构成本偏好。公开基准测试表明，本方法优于现有基线，在不同嵌入模型上均达到最先进性能。

Optimizing LLM-Based Multi-Agent System with Textual Feedback: A Case Study on Software Development

Abstract

arXiv:2505.16086v1 Announce Type: new Abstract: We have seen remarkable progress in large language models (LLMs) empowered multi-agent systems solving complex tasks necessitating cooperation among experts with diverse skills. However, optimizing LLM-based multi-agent systems remains challenging. In this work, we perform an empirical case study on group optimization of role-based multi-agent systems utilizing natural language feedback for challenging software development tasks under various evaluation dimensions. We propose a two-step agent prompts optimization pipeline: identifying underperforming agents with their failure explanations utilizing textual feedback and then optimizing system prompts of identified agents utilizing failure explanations. We then study the impact of various optimization settings on system performance with two comparison groups: online against offline optimization and individual against group optimization. For group optimization, we study two prompting strategies: one-pass and multi-pass prompting optimizations. Overall, we demonstrate the effectiveness of our optimization method for role-based multi-agent systems tackling software development tasks evaluated on diverse evaluation dimensions, and we investigate the impact of diverse optimization settings on group behaviors of the multi-agent systems to provide practical insights for future development.

摘要

我们观察到基于大语言模型（LLM）的多智能体系统在解决需要不同领域专家协作的复杂任务方面取得了显著进展。然而，LLM驱动的多智能体系统优化仍具挑战性。本研究通过实证案例，探讨了在软件开发任务中利用自然语言反馈对基于角色的多智能体系统进行群体优化的效果，并从多个评估维度展开分析。我们提出了一种两阶段的智能体提示优化流程：首先通过文本反馈识别表现欠佳的智能体及其失败原因，随后根据失败解释对已识别智能体的系统提示进行优化。通过设置在线与离线优化、个体与群体优化两组对比实验，我们研究了不同优化设置对系统性能的影响。在群体优化方面，我们比较了单轮提示与多轮提示两种优化策略。实验结果表明，该方法能有效提升基于角色的多智能体系统在软件开发任务中的表现，且在不同评估维度上均显示出优化效果。此外，我们还探究了不同优化设置对多智能体系统群体行为的影响，为未来研究提供了实践启示。

LLM-Powered AI Agent Systems and Their Applications in Industry

Abstract

arXiv:2505.16120v1 Announce Type: new Abstract: The emergence of Large Language Models (LLMs) has reshaped agent systems. Unlike traditional rule-based agents with limited task scope, LLM-powered agents offer greater flexibility, cross-domain reasoning, and natural language interaction. Moreover, with the integration of multi-modal LLMs, current agent systems are highly capable of processing diverse data modalities, including text, images, audio, and structured tabular data, enabling richer and more adaptive real-world behavior. This paper comprehensively examines the evolution of agent systems from the pre-LLM era to current LLM-powered architectures. We categorize agent systems into software-based, physical, and adaptive hybrid systems, highlighting applications across customer service, software development, manufacturing automation, personalized education, financial trading, and healthcare. We further discuss the primary challenges posed by LLM-powered agents, including high inference latency, output uncertainty, lack of evaluation metrics, and security vulnerabilities, and propose potential solutions to mitigate these concerns.

摘要

大型语言模型（LLMs）的出现重塑了智能体系统。与传统任务范围有限的基于规则的智能体不同，基于LLM的智能体具有更高的灵活性、跨领域推理能力和自然语言交互特性。此外，随着多模态LLM的整合，当前智能体系统能够高效处理包括文本、图像、音频和结构化表格数据在内的多种数据模态，从而实现更丰富且更具适应性的现实世界行为。本文系统考察了智能体系统从前LLM时代到当前基于LLM架构的演进历程，将智能体系统划分为软件型、物理型和自适应混合型三类，重点阐述了其在客户服务、软件开发、制造自动化、个性化教育、金融交易和医疗健康等领域的应用。我们进一步探讨了基于LLM的智能体面临的主要挑战，包括高推理延迟、输出不确定性、评估指标缺失和安全漏洞等问题，并提出了缓解这些问题的潜在解决方案。

TrialPanorama: Database and Benchmark for Systematic Review and Design of Clinical Trials

Abstract

arXiv:2505.16097v1 Announce Type: new Abstract: Developing artificial intelligence (AI) for vertical domains requires a solid data foundation for both training and evaluation. In this work, we introduce TrialPanorama, a large-scale, structured database comprising 1,657,476 clinical trial records aggregated from 15 global sources. The database captures key aspects of trial design and execution, including trial setups, interventions, conditions, biomarkers, and outcomes, and links them to standard biomedical ontologies such as DrugBank and MedDRA. This structured and ontology-grounded design enables TrialPanorama to serve as a unified, extensible resource for a wide range of clinical trial tasks, including trial planning, design, and summarization. To demonstrate its utility, we derive a suite of benchmark tasks directly from the TrialPanorama database. The benchmark spans eight tasks across two categories: three for systematic review (study search, study screening, and evidence summarization) and five for trial design (arm design, eligibility criteria, endpoint selection, sample size estimation, and trial completion assessment). The experiments using five state-of-the-art large language models (LLMs) show that while general-purpose LLMs exhibit some zero-shot capability, their performance is still inadequate for high-stakes clinical trial workflows. We release TrialPanorama database and the benchmark to facilitate further research on AI for clinical trials.

摘要

开发垂直领域人工智能（AI）需要建立坚实的训练与评估数据基础。本研究推出TrialPanorama——一个包含1,657,476条临床试验记录的大规模结构化数据库，这些记录聚合自全球15个数据源。该数据库完整捕获试验设计与执行的关键要素，包括试验方案、干预措施、适应症、生物标志物及结局指标，并将其与DrugBank、MedDRA等标准生物医学本体进行关联。这种基于本体的结构化设计使TrialPanorama能作为统一的、可扩展的资源平台，支持包括试验规划、设计与总结在内的多种临床试验任务。为验证其实用性，我们直接从TrialPanorama数据库衍生出一套基准测试任务，涵盖两大类别共八项任务：系统评价类（研究检索、研究筛选与证据总结）三项，试验设计类（分组设计、入排标准、终点选择、样本量估算与试验完成度评估）五项。采用五种前沿大语言模型（LLM）的实验表明，尽管通用LLM展现出一定的零样本能力，但其性能仍无法满足高风险的临床试验工作流程需求。我们公开TrialPanorama数据库及基准测试，以促进临床试验AI的深入研究。

Sudoku-Bench: Evaluating creative reasoning with Sudoku variants

Abstract

arXiv:2505.16135v1 Announce Type: new Abstract: Existing reasoning benchmarks for large language models (LLMs) frequently fail to capture authentic creativity, often rewarding memorization of previously observed patterns. We address this shortcoming with Sudoku-Bench, a curated benchmark of challenging and unconventional Sudoku variants specifically selected to evaluate creative, multi-step logical reasoning. Sudoku variants form an unusually effective domain for reasoning research: each puzzle introduces unique or subtly interacting constraints, making memorization infeasible and requiring solvers to identify novel logical breakthroughs (``break-ins''). Despite their diversity, Sudoku variants maintain a common and compact structure, enabling clear and consistent evaluation. Sudoku-Bench includes a carefully chosen puzzle set, a standardized text-based puzzle representation, and flexible tools compatible with thousands of publicly available puzzles -- making it easy to extend into a general research environment. Baseline experiments show that state-of-the-art LLMs solve fewer than 15% of puzzles unaided, highlighting significant opportunities to advance long-horizon, strategic reasoning capabilities.

摘要

现有针对大语言模型（LLM）的推理基准测试往往无法捕捉真正的创造力，通常仅奖励对已知模式的记忆。为弥补这一缺陷，我们提出Sudoku-Bench——一个精心设计的数独变体基准测试集，专门用于评估创造性、多步骤逻辑推理能力。数独变体构成了推理研究中异常有效的领域：每个谜题都包含独特或微妙互动的约束条件，使得记忆失效，并要求求解者发现新颖的逻辑突破口（"破局点"）。尽管具有多样性，数独变体仍保持着统一紧凑的结构，可实现清晰一致的评估。Sudoku-Bench包含精心挑选的谜题集、标准化的文本谜题表示法，以及与数千个公开谜题兼容的灵活工具，便于扩展为通用研究环境。基线实验表明，最先进的LLM在无辅助情况下仅能解决不足15%的谜题，这为推进长程战略推理能力提供了重要研究空间。

How Memory Management Impacts LLM Agents: An Empirical Study of Experience-Following Behavior

Abstract

arXiv:2505.16067v1 Announce Type: new Abstract: Memory is a critical component in large language model (LLM)-based agents, enabling them to store and retrieve past executions to improve task performance over time. In this paper, we conduct an empirical study on how memory management choices impact the LLM agents' behavior, especially their long-term performance. Specifically, we focus on two fundamental memory operations that are widely used by many agent frameworks-addition, which incorporates new experiences into the memory base, and deletion, which selectively removes past experiences-to systematically study their impact on the agent behavior. Through our quantitative analysis, we find that LLM agents display an experience-following property: high similarity between a task input and the input in a retrieved memory record often results in highly similar agent outputs. Our analysis further reveals two significant challenges associated with this property: error propagation, where inaccuracies in past experiences compound and degrade future performance, and misaligned experience replay, where outdated or irrelevant experiences negatively influence current tasks. Through controlled experiments, we show that combining selective addition and deletion strategies can help mitigate these negative effects, yielding an average absolute performance gain of 10% compared to naive memory growth. Furthermore, we highlight how memory management choices affect agents' behavior under challenging conditions such as task distribution shifts and constrained memory resources. Our findings offer insights into the behavioral dynamics of LLM agent memory systems and provide practical guidance for designing memory components that support robust, long-term agent performance. We also release our code to facilitate further study.

摘要

记忆是基于大语言模型（LLM）智能体的关键组件，使其能够存储和检索过往执行记录，从而随时间推移提升任务表现。本文通过实证研究探讨了记忆管理策略如何影响LLM智能体行为，尤其是其长期性能。我们重点研究了当前多数智能体框架广泛采用的两种基础记忆操作——添加（将新经验纳入记忆库）和删除（选择性移除过往经验）——系统分析其对智能体行为的影响。定量研究表明，LLM智能体表现出"经验跟随"特性：当任务输入与检索记忆记录的输入高度相似时，智能体输出往往也高度相似。分析进一步揭示了该特性引发的两大挑战：错误传播（过往经验中的错误累积导致未来性能下降）与错位经验回放（过时或无关经验对当前任务产生负面影响）。通过控制实验，我们发现结合选择性添加与删除策略能有效缓解这些负面效应，相比简单记忆增长策略平均可获得10%的绝对性能提升。此外，我们还阐明了在任务分布变化和内存资源受限等挑战条件下，记忆管理选择如何影响智能体行为。本研究揭示了LLM智能体记忆系统的行为动力学特征，为设计支持稳健长期性能的记忆组件提供了实践指导。我们同步公开代码以促进后续研究。

Can AI Read Between The Lines? Benchmarking LLMs On Financial Nuance

Abstract

arXiv:2505.16090v1 Announce Type: new Abstract: As of 2025, Generative Artificial Intelligence (GenAI) has become a central tool for productivity across industries. Beyond text generation, GenAI now plays a critical role in coding, data analysis, and research workflows. As large language models (LLMs) continue to evolve, it is essential to assess the reliability and accuracy of their outputs, especially in specialized, high-stakes domains like finance. Most modern LLMs transform text into numerical vectors, which are used in operations such as cosine similarity searches to generate responses. However, this abstraction process can lead to misinterpretation of emotional tone, particularly in nuanced financial contexts. While LLMs generally excel at identifying sentiment in everyday language, these models often struggle with the nuanced, strategically ambiguous language found in earnings call transcripts. Financial disclosures frequently embed sentiment in hedged statements, forward-looking language, and industry-specific jargon, making it difficult even for human analysts to interpret consistently, let alone AI models. This paper presents findings from the Santa Clara Microsoft Practicum Project, led by Professor Charlie Goldenberg, which benchmarks the performance of Microsoft's Copilot, OpenAI's ChatGPT, Google's Gemini, and traditional machine learning models for sentiment analysis of financial text. Using Microsoft earnings call transcripts, the analysis assesses how well LLM-derived sentiment correlates with market sentiment and stock movements and evaluates the accuracy of model outputs. Prompt engineering techniques are also examined to improve sentiment analysis results. Visualizations of sentiment consistency are developed to evaluate alignment between tone and stock performance, with sentiment trends analyzed across Microsoft's lines of business to determine which segments exert the greatest influence.

摘要

截至2025年，生成式人工智能（GenAI）已成为各行业生产力的核心工具。除文本生成外，GenAI目前在编程、数据分析和研究流程中发挥着关键作用。随着大语言模型（LLM）的持续演进，评估其输出结果的可靠性与准确性变得至关重要——尤其是在金融等专业高风险领域。现代主流LLM通常将文本转化为数值向量，通过余弦相似度搜索等操作生成响应。然而这种抽象化过程可能导致情感基调的误判，在微妙的金融语境中尤为明显。虽然LLM对日常语言的情感识别表现优异，但面对财报电话会议记录中具有战略模糊性的复杂语言时，这些模型往往表现欠佳。财务披露文件常将情感隐含于对冲陈述、前瞻性表述及行业特定术语中，即使人类分析师也难以保持一致性解读，AI模型则更为困难。本文呈现了由Charlie Goldenberg教授主持的圣克拉拉微软实践项目的研究成果，该项目对微软Copilot、OpenAI的ChatGPT、谷歌Gemini及传统机器学习模型在金融文本情感分析中的表现进行了基准测试。通过分析微软财报电话会议记录，研究评估了LLM推导的情感与市场情绪及股价波动的相关性，并检验了模型输出的准确性。研究还考察了提示词工程技术对改善情感分析效果的作用，开发了情感一致性可视化方案以评估语调与股票表现的匹配度，并通过分析微软各业务线的情感趋势来确定最具影响力的业务板块。

MAPS: A Multilingual Benchmark for Global Agent Performance and Security

Abstract

arXiv:2505.15935v1 Announce Type: new Abstract: Agentic AI systems, which build on Large Language Models (LLMs) and interact with tools and memory, have rapidly advanced in capability and scope. Yet, since LLMs have been shown to struggle in multilingual settings, typically resulting in lower performance and reduced safety, agentic systems risk inheriting these limitations. This raises concerns about the global accessibility of such systems, as users interacting in languages other than English may encounter unreliable or security-critical agent behavior. Despite growing interest in evaluating agentic AI, existing benchmarks focus exclusively on English, leaving multilingual settings unexplored. To address this gap, we propose MAPS, a multilingual benchmark suite designed to evaluate agentic AI systems across diverse languages and tasks. MAPS builds on four widely used agentic benchmarks - GAIA (real-world tasks), SWE-bench (code generation), MATH (mathematical reasoning), and the Agent Security Benchmark (security). We translate each dataset into ten diverse languages, resulting in 805 unique tasks and 8,855 total language-specific instances. Our benchmark suite enables a systematic analysis of how multilingual contexts affect agent performance and robustness. Empirically, we observe consistent degradation in both performance and security when transitioning from English to other languages, with severity varying by task and correlating with the amount of translated input. Building on these findings, we provide actionable recommendations to guide agentic AI systems development and assessment under multilingual settings. This work establishes a standardized evaluation framework, encouraging future research towards equitable, reliable, and globally accessible agentic AI. MAPS benchmark suite is publicly available at https://huggingface.co/datasets/Fujitsu-FRE/MAPS

摘要

基于大型语言模型（LLMs）并与工具及记忆系统交互的代理式人工智能系统，其能力与应用范围正快速发展。然而，由于LLMs已被证明在多语言环境中存在性能下降与安全性降低的问题，代理系统可能继承这些缺陷。这引发了对此类系统全球可访问性的担忧——使用非英语语言的用户可能会遇到不可靠或存在安全风险的代理行为。尽管对代理式AI评估的关注日益增长，现有基准测试仍仅聚焦英语环境，多语言场景尚未得到探索。为填补这一空白，我们提出MAPS：一个旨在评估多语言多任务场景下代理式AI系统的基准测试套件。MAPS基于四个广泛使用的代理基准构建——GAIA（现实世界任务）、SWE-bench（代码生成）、MATH（数学推理）和Agent Security Benchmark（安全性），将每个数据集翻译为十种不同语言，最终形成805项独特任务和8,855个特定语言实例。该套件支持系统分析多语言语境如何影响代理性能与鲁棒性。实证研究表明，从英语转换到其他语言时，性能与安全性均呈现一致性下降，其严重程度因任务而异并与翻译输入量相关。基于这些发现，我们提出可操作建议以指导多语言环境下的代理式AI系统开发与评估。本研究建立了标准化评估框架，推动未来研究向公平、可靠且全球可访问的代理式AI发展。MAPS基准套件公开于https://huggingface.co/datasets/Fujitsu-FRE/MAPS。

LightRouter: Towards Efficient LLM Collaboration with Minimal Overhead

Abstract

arXiv:2505.16221v1 Announce Type: new Abstract: The rapid advancement of large language models has unlocked remarkable capabilities across a diverse array of natural language processing tasks. However, the considerable differences among available LLMs-in terms of cost, performance, and computational demands-pose significant challenges for users aiming to identify the most suitable model for specific tasks. In this work, we present LightRouter, a novel framework designed to systematically select and integrate a small subset of LLMs from a larger pool, with the objective of jointly optimizing both task performance and cost efficiency. LightRouter leverages an adaptive selection mechanism to identify models that require only a minimal number of boot tokens, thereby reducing costs, and further employs an effective integration strategy to combine their outputs. Extensive experiments across multiple benchmarks demonstrate that LightRouter matches or outperforms widely-used ensemble baselines, achieving up to a 25% improvement in accuracy. Compared with leading high-performing models, LightRouter achieves comparable performance while reducing inference costs by up to 27%. Importantly, our framework operates without any prior knowledge of individual models and relies exclusively on inexpensive, lightweight models. This work introduces a practical approach for efficient LLM selection and provides valuable insights into optimal strategies for model combination.

摘要

大型语言模型的快速发展使其在各类自然语言处理任务中展现出卓越能力。然而现有模型在成本、性能和计算需求方面的显著差异，为用户选择特定任务的最优模型带来了重大挑战。本研究提出LightRouter框架，该系统能从大规模模型池中智能筛选并整合少量模型，协同优化任务性能与成本效益。LightRouter采用自适应选择机制识别仅需极少量启动标记的模型以降低成本，并通过高效集成策略融合各模型输出。跨多个基准的广泛实验表明，LightRouter达到或超越主流集成基线方法，最高可实现25%的准确率提升。与顶尖高性能模型相比，本框架在保持相当性能的同时，最高可降低27%的推理成本。值得注意的是，该框架无需任何先验模型知识，仅依赖轻量级廉价模型即可运行。本研究为高效选择语言模型提供了实用方案，并为最优模型组合策略提供了重要见解。

MAPLE: Many-Shot Adaptive Pseudo-Labeling for In-Context Learning

Abstract

arXiv:2505.16225v1 Announce Type: new Abstract: In-Context Learning (ICL) empowers Large Language Models (LLMs) to tackle diverse tasks by incorporating multiple input-output examples, known as demonstrations, into the input of LLMs. More recently, advancements in the expanded context windows of LLMs have led to many-shot ICL, which uses hundreds of demonstrations and outperforms few-shot ICL, which relies on fewer examples. However, this approach is often hindered by the high cost of obtaining large amounts of labeled data. To address this challenge, we propose Many-Shot Adaptive Pseudo-LabEling, namely MAPLE, a novel influence-based many-shot ICL framework that utilizes pseudo-labeled samples to compensate for the lack of label information. We first identify a subset of impactful unlabeled samples and perform pseudo-labeling on them by querying LLMs. These pseudo-labeled samples are then adaptively selected and tailored to each test query as input to improve the performance of many-shot ICL, without significant labeling costs. Extensive experiments on real-world datasets demonstrate the effectiveness of our framework, showcasing its ability to enhance LLM adaptability and performance with limited labeled data.

摘要

上下文学习(ICL)通过将多个输入-输出示例(即演示样本)整合到大型语言模型(LLMs)的输入中，使其能够处理多样化任务。近期，随着LLMs上下文窗口的扩展，出现了利用数百个演示样本的多样本ICL，其性能优于依赖少量示例的少样本ICL。然而，该方法常受限于获取大量标注数据的高成本。为解决这一挑战，我们提出基于影响力的多样本自适应伪标注框架MAPLE，通过利用伪标注样本弥补标签信息的不足。该框架首先识别具有影响力的未标注样本子集，并通过查询LLMs对其进行伪标注。这些伪标注样本随后被自适应地筛选并针对每个测试查询定制化输入，从而在不显著增加标注成本的前提下提升多样本ICL的性能。在真实数据集上的大量实验验证了本框架的有效性，展示了其在有限标注数据条件下增强LLM适应性与性能的能力。

SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning

Abstract

arXiv:2505.16186v1 Announce Type: new Abstract: Large Reasoning Models (LRMs) introduce a new generation paradigm of explicitly reasoning before answering, leading to remarkable improvements in complex tasks. However, they pose great safety risks against harmful queries and adversarial attacks. While recent mainstream safety efforts on LRMs, supervised fine-tuning (SFT), improve safety performance, we find that SFT-aligned models struggle to generalize to unseen jailbreak prompts. After thorough investigation of LRMs' generation, we identify a safety aha moment that can activate safety reasoning and lead to a safe response. This aha moment typically appears in the `key sentence', which follows models' query understanding process and can indicate whether the model will proceed safely. Based on these insights, we propose SafeKey, including two complementary objectives to better activate the safety aha moment in the key sentence: (1) a Dual-Path Safety Head to enhance the safety signal in the model's internal representations before the key sentence, and (2) a Query-Mask Modeling objective to improve the models' attention on its query understanding, which has important safety hints. Experiments across multiple safety benchmarks demonstrate that our methods significantly improve safety generalization to a wide range of jailbreak attacks and out-of-distribution harmful prompts, lowering the average harmfulness rate by 9.6%, while maintaining general abilities. Our analysis reveals how SafeKey enhances safety by reshaping internal attention and improving the quality of hidden representations.

摘要

大型推理模型（LRMs）引入了一种"先推理后回答"的新范式，在复杂任务中实现了显著性能提升。然而这类模型在面对恶意查询和对抗攻击时存在重大安全隐患。尽管当前主流的安全对齐方法——监督微调（SFT）能提升模型安全性，我们发现经SFT对齐的模型对未见过的越狱提示泛化能力不足。通过对LRMs生成过程的深入研究，我们识别出能够激活安全推理并产生安全响应的"安全顿悟时刻"。该时刻通常出现在"关键句"中，这些句子紧随模型的查询理解过程，可预示模型后续行为是否安全。基于这些发现，我们提出SafeKey框架，包含两个互补目标以更好激活关键句中的安全顿悟：(1) 双路径安全头模块——增强关键句前模型内部表征的安全信号；(2) 查询掩码建模目标——提升模型对包含重要安全线索的查询理解过程的关注度。在多个安全基准测试中，我们的方法显著提升了对各类越狱攻击和分布外有害提示的安全泛化能力，平均危害率降低9.6%，同时保持通用能力。分析表明SafeKey通过重塑内部注意力机制和提升隐含表征质量来增强安全性。

Losing is for Cherishing: Data Valuation Based on Machine Unlearning and Shapley Value

Abstract

arXiv:2505.16147v1 Announce Type: new Abstract: The proliferation of large models has intensified the need for efficient data valuation methods to quantify the contribution of individual data providers. Traditional approaches, such as game-theory-based Shapley value and influence-function-based techniques, face prohibitive computational costs or require access to full data and model training details, making them hardly achieve partial data valuation. To address this, we propose Unlearning Shapley, a novel framework that leverages machine unlearning to estimate data values efficiently. By unlearning target data from a pretrained model and measuring performance shifts on a reachable test set, our method computes Shapley values via Monte Carlo sampling, avoiding retraining and eliminating dependence on full data. Crucially, Unlearning Shapley supports both full and partial data valuation, making it scalable for large models (e.g., LLMs) and practical for data markets. Experiments on benchmark datasets and large-scale text corpora demonstrate that our approach matches the accuracy of state-of-the-art methods while reducing computational overhead by orders of magnitude. Further analysis confirms a strong correlation between estimated values and the true impact of data subsets, validating its reliability in real-world scenarios. This work bridges the gap between data valuation theory and practical deployment, offering a scalable, privacy-compliant solution for modern AI ecosystems.

摘要

大型模型的激增使得对高效数据估值方法的需求日益迫切，以量化个体数据提供者的贡献。传统方法如基于博弈论的Shapley值和基于影响函数的技术，面临着极高的计算成本或需要获取完整数据及模型训练细节，导致其难以实现局部数据估值。为此，我们提出"遗忘Shapley"——一种创新框架，通过利用机器学习遗忘机制来高效估算数据价值。该方法通过从预训练模型中遗忘目标数据，并在可达测试集上测量性能变化，借助蒙特卡洛采样计算Shapley值，从而避免模型重训练并消除对完整数据的依赖。关键的是，"遗忘Shapley"同时支持完整和局部数据估值，使其能够适用于大型模型（如大语言模型）并满足数据市场的实际需求。在基准数据集和大规模文本语料上的实验表明，我们的方法在保持与最先进技术相当精度的同时，将计算开销降低了数个数量级。进一步分析证实，估算值与数据子集的真实影响之间存在强相关性，验证了其在实际场景中的可靠性。本研究弥合了数据估值理论与实际应用之间的鸿沟，为现代AI生态系统提供了可扩展且符合隐私要求的解决方案。

Dynamic Sampling that Adapts: Iterative DPO for Self-Aware Mathematical Reasoning

Abstract

arXiv:2505.16176v1 Announce Type: new Abstract: In the realm of data selection for reasoning tasks, existing approaches predominantly rely on externally predefined static metrics such as difficulty and diversity, which are often designed for supervised fine-tuning (SFT) and lack adaptability to continuous training processes. A critical limitation of these methods is their inability to dynamically align with the evolving capabilities of models during online training, a gap that becomes increasingly pronounced with the rise of dynamic training paradigms and online reinforcement learning (RL) frameworks (e.g., R1 models). To address this, we introduce SAI-DPO, an algorithm that dynamically selects training data by continuously assessing a model's stage-specific reasoning abilities across different training phases. By integrating real-time model performance feedback, SAI-DPO adaptively adapts data selection to the evolving strengths and weaknesses of the model, thus enhancing both data utilization efficiency and final task performance. Extensive experiments on three state-of-the-art models and eight mathematical reasoning benchmarks, including challenging competition-level datasets (e.g., AIME24 and AMC23), demonstrate that SAI-DPO achieves an average performance boost of up to 21.3 percentage points, with particularly notable improvements of 10 and 15 points on AIME24 and AMC23, respectively. These results highlight the superiority of dynamic, model-adaptive data selection over static, externally defined strategies in advancing reasoning.

摘要

在面向推理任务的数据选择领域，现有方法主要依赖于外部预定义的静态指标（如难度和多样性），这些指标通常是为监督微调（SFT）设计的，缺乏对持续训练过程的适应性。这些方法的关键局限在于无法与模型在线训练时动态演进的能力保持同步，这一缺陷随着动态训练范式与在线强化学习（RL）框架（如R1模型）的兴起而日益凸显。为此，我们提出SAI-DPO算法，该算法通过持续评估模型在不同训练阶段特有的推理能力来实现动态数据选择。通过整合实时模型性能反馈，SAI-DPO能自适应地根据模型动态变化的优劣势调整数据选择策略，从而同时提升数据利用效率和最终任务表现。在三个前沿模型和八个数学推理基准（包括AIME24、AMC23等竞赛级高难度数据集）上的大量实验表明，SAI-DPO平均可获得高达21.3个百分点的性能提升，其中在AIME24和AMC23上分别取得10分和15分的显著改进。这些结果充分证明，相较于静态的外部定义策略，动态的模型自适应数据选择方法在推进推理能力方面具有显著优势。

No Black Boxes: Interpretable and Interactable Predictive Healthcare with Knowledge-Enhanced Agentic Causal Discovery

Abstract

arXiv:2505.16288v1 Announce Type: new Abstract: Deep learning models trained on extensive Electronic Health Records (EHR) data have achieved high accuracy in diagnosis prediction, offering the potential to assist clinicians in decision-making and treatment planning. However, these models lack two crucial features that clinicians highly value: interpretability and interactivity. The ``black-box'' nature of these models makes it difficult for clinicians to understand the reasoning behind predictions, limiting their ability to make informed decisions. Additionally, the absence of interactive mechanisms prevents clinicians from incorporating their own knowledge and experience into the decision-making process. To address these limitations, we propose II-KEA, a knowledge-enhanced agent-driven causal discovery framework that integrates personalized knowledge databases and agentic LLMs. II-KEA enhances interpretability through explicit reasoning and causal analysis, while also improving interactivity by allowing clinicians to inject their knowledge and experience through customized knowledge bases and prompts. II-KEA is evaluated on both MIMIC-III and MIMIC-IV, demonstrating superior performance along with enhanced interpretability and interactivity, as evidenced by its strong results from extensive case studies.

摘要

基于大规模电子健康记录（EHR）数据训练的深度学习模型在诊断预测方面已实现高精度，为辅助临床医生决策和治疗规划提供了可能。然而，这些模型缺乏临床医生高度重视的两个关键特性：可解释性与交互性。模型的"黑箱"特性使临床医生难以理解预测背后的逻辑，限制了其做出知情决策的能力。此外，交互机制的缺失阻碍了临床医生将自身知识与经验融入决策过程。为解决这些局限，我们提出II-KEA——一个整合个性化知识库与智能体大语言模型的知识增强型智能体驱动因果发现框架。II-KEA通过显式推理与因果分析提升可解释性，同时允许临床医生通过定制化知识库和提示词注入其知识经验以增强交互性。在MIMIC-III和MIMIC-IV数据集上的评估表明，II-KEA不仅表现出卓越性能，其增强的可解释性与交互性也得到广泛案例研究的有力验证。

EquivPruner: Boosting Efficiency and Quality in LLM-Based Search via Action Pruning

Abstract

arXiv:2505.16312v1 Announce Type: new Abstract: Large Language Models (LLMs) excel at complex reasoning through search algorithms, yet current strategies often suffer from massive token consumption due to redundant exploration of semantically equivalent steps. Existing semantic similarity methods struggle to accurately identify such equivalence in domain-specific contexts like mathematical reasoning. To address this, we propose EquivPruner, a simple yet effective approach that identifies and prunes semantically equivalent actions during LLM reasoning search. We also introduce MathEquiv, the first dataset we created for mathematical statement equivalence, which enables the training of a lightweight equivalence detector. Extensive experiments across various models and tasks demonstrate that EquivPruner significantly reduces token consumption, improving searching efficiency and often bolstering reasoning accuracy. For instance, when applied to Qwen2.5-Math-7B-Instruct on GSM8K, EquivPruner reduced token consumption by 48.1% while also improving accuracy. Our code is available at https://github.com/Lolo1222/EquivPruner.

摘要

大语言模型（LLMs）通过搜索算法擅长复杂推理，但现有策略常因对语义等价步骤的冗余探索而导致大量标记消耗。现有语义相似性方法难以在数学推理等特定领域情境中准确识别此类等价性。为此，我们提出EquivPruner——一种简单而有效的方法，可在LLM推理搜索过程中识别并剪枝语义等价动作。我们还创建了首个数学陈述等价数据集MathEquiv，用于训练轻量级等价检测器。跨多种模型与任务的广泛实验表明，EquivPruner能显著降低标记消耗，提升搜索效率并常增强推理准确率。例如，在GSM8K数据集上应用Qwen2.5-Math-7B-Instruct模型时，EquivPruner将标记消耗降低48.1%，同时提高准确率。代码详见https://github.com/Lolo1222/EquivPruner。

How do Scaling Laws Apply to Knowledge Graph Engineering Tasks? The Impact of Model Size on Large Language Model Performance

Abstract

arXiv:2505.16276v1 Announce Type: new Abstract: When using Large Language Models (LLMs) to support Knowledge Graph Engineering (KGE), one of the first indications when searching for an appropriate model is its size. According to the scaling laws, larger models typically show higher capabilities. However, in practice, resource costs are also an important factor and thus it makes sense to consider the ratio between model performance and costs. The LLM-KG-Bench framework enables the comparison of LLMs in the context of KGE tasks and assesses their capabilities of understanding and producing KGs and KG queries. Based on a dataset created in an LLM-KG-Bench run covering 26 open state-of-the-art LLMs, we explore the model size scaling laws specific to KGE tasks. In our analyses, we assess how benchmark scores evolve between different model size categories. Additionally, we inspect how the general score development of single models and families of models correlates to their size. Our analyses revealed that, with a few exceptions, the model size scaling laws generally also apply to the selected KGE tasks. However, in some cases, plateau or ceiling effects occurred, i.e., the task performance did not change much between a model and the next larger model. In these cases, smaller models could be considered to achieve high cost-effectiveness. Regarding models of the same family, sometimes larger models performed worse than smaller models of the same family. These effects occurred only locally. Hence it is advisable to additionally test the next smallest and largest model of the same family.

摘要

当使用大语言模型（LLMs）支持知识图谱工程（KGE）时，搜索合适模型的第一个指标通常是其规模。根据缩放定律，较大模型通常表现出更高能力。然而在实际应用中，资源成本也是重要考量因素，因此需要权衡模型性能与成本之间的比率。LLM-KG-Bench框架能够比较LLMs在KGE任务中的表现，评估其理解和生成知识图谱及图谱查询的能力。基于LLM-KG-Bench运行中创建的涵盖26个开源最先进LLMs的数据集，我们探索了特定于KGE任务的模型规模缩放定律。在分析中，我们评估了不同规模类别模型间基准分数的演变情况，并检验了单个模型及同系列模型的总体得分发展与其规模的相关性。分析表明，除少数例外情况外，模型规模缩放定律通常也适用于所选KGE任务。但在某些情况下会出现平台或天花板效应，即模型与更大模型之间的任务性能变化不大。此类情况下，可考虑采用较小模型以实现高成本效益。对于同系列模型，有时较大模型表现反而逊于较小版本，这些效应仅局部出现。因此建议额外测试同系列中相邻更小和更大的模型。

Incentivizing Dual Process Thinking for Efficient Large Language Model Reasoning

Abstract

arXiv:2505.16315v1 Announce Type: new Abstract: Large reasoning models (LRMs) have demonstrated strong performance on complex reasoning tasks, but often suffer from overthinking, generating redundant content regardless of task difficulty. Inspired by the dual process theory in cognitive science, we propose Adaptive Cognition Policy Optimization (ACPO), a reinforcement learning framework that enables LRMs to achieve efficient reasoning through adaptive cognitive allocation and dynamic system switch. ACPO incorporates two key components: (1) introducing system-aware reasoning tokens to explicitly represent the thinking modes thereby making the model's cognitive process transparent, and (2) integrating online difficulty estimation and token length budget to guide adaptive system switch and reasoning during reinforcement learning. To this end, we propose a two-stage training strategy. The first stage begins with supervised fine-tuning to cold start the model, enabling it to generate reasoning paths with explicit thinking modes. In the second stage, we apply ACPO to further enhance adaptive system switch for difficulty-aware reasoning. Experimental results demonstrate that ACPO effectively reduces redundant reasoning while adaptively adjusting cognitive allocation based on task complexity, achieving efficient hybrid reasoning.

摘要

大规模推理模型（LRMs）在复杂推理任务中展现出强大性能，但普遍存在过度思考现象，即无论任务难度如何都会生成冗余内容。受认知科学中双过程理论启发，我们提出自适应认知策略优化（ACPO）——一种通过自适应认知分配与动态系统切换实现高效推理的强化学习框架。ACPO包含两个核心组件：（1）引入系统感知推理标记来显式表征思维模式，从而使模型的认知过程透明化；（2）集成在线难度评估与标记长度预算机制，以指导强化学习过程中的自适应系统切换与推理。为此，我们设计了两阶段训练策略：第一阶段通过监督微调冷启动模型，使其生成具有显式思维模式的推理路径；第二阶段应用ACPO进一步强化面向难度感知推理的自适应系统切换能力。实验结果表明，ACPO能有效减少冗余推理，同时根据任务复杂度自适应调整认知分配，实现高效的混合推理。

Smaller, Smarter, Closer: The Edge of Collaborative Generative AI

Abstract

arXiv:2505.16499v1 Announce Type: new Abstract: The rapid adoption of generative AI (GenAI), particularly Large Language Models (LLMs), has exposed critical limitations of cloud-centric deployments, including latency, cost, and privacy concerns. Meanwhile, Small Language Models (SLMs) are emerging as viable alternatives for resource-constrained edge environments, though they often lack the capabilities of their larger counterparts. This article explores the potential of collaborative inference systems that leverage both edge and cloud resources to address these challenges. By presenting distinct cooperation strategies alongside practical design principles and experimental insights, we offer actionable guidance for deploying GenAI across the computing continuum.

摘要

生成式人工智能（GenAI），尤其是大语言模型（LLMs）的快速普及，暴露出以云为中心部署模式的关键局限性，包括延迟、成本和隐私问题。与此同时，小语言模型（SLMs）正逐渐成为资源受限边缘环境的可行替代方案，但其能力通常不及大型模型。本文探讨了利用边缘与云计算资源的协同推理系统应对这些挑战的潜力。通过提出不同的协作策略，并结合实际设计原则与实验洞察，我们为在整个计算连续体上部署GenAI提供了可操作的指导。

Internal Bias in Reasoning Models leads to Overthinking

Abstract

arXiv:2505.16448v1 Announce Type: new Abstract: While current reasoning models possess strong exploratory capabilities, they are often criticized for overthinking due to redundant and unnecessary reflections. In this work, we reveal for the first time that overthinking in reasoning models may stem from their internal bias towards input texts. Upon encountering a reasoning problem, the model immediately forms a preliminary guess about the answer, which we term as an internal bias since it is not derived through actual reasoning. When this guess conflicts with its reasoning result, the model tends to engage in reflection, leading to the waste of computational resources. Through further interpretability experiments, we find that this behavior is largely driven by the model's excessive attention to the input section, which amplifies the influence of internal bias on its decision-making process. Additionally, by masking out the original input section, the affect of internal bias can be effectively alleviated and the reasoning length could be reduced by 31%-53% across different complex reasoning tasks. Notably, in most cases, this approach also leads to improvements in accuracy. These findings demonstrate a causal relationship between internal bias and overthinking.

摘要

当前推理模型虽具备强大的探索能力，却常因冗余且不必要的反思而遭受"过度思考"的诟病。本研究首次揭示推理模型的过度思考可能源于其对输入文本的内部偏见。当面对推理问题时，模型会立即形成对答案的初步猜测——这种未经实际推理产生的预判被我们定义为内部偏见。当该猜测与推理结果冲突时，模型倾向于启动反思机制，导致计算资源浪费。通过可解释性实验发现，该行为主要源于模型对输入段的过度关注，这种关注放大了内部偏见对决策过程的影响。实验表明，通过遮蔽原始输入段可有效缓解内部偏见效应，使不同复杂推理任务中的推理长度减少31%-53%。值得注意的是，在多数情况下该方法还能提升准确率。这些发现证实了内部偏见与过度思考之间存在因果关系。

ReflectEvo: Improving Meta Introspection of Small LLMs by Learning Self-Reflection

Abstract

arXiv:2505.16475v1 Announce Type: new Abstract: We present a novel pipeline, ReflectEvo, to demonstrate that small language models (SLMs) can enhance meta introspection through reflection learning. This process iteratively generates self-reflection for self-training, fostering a continuous and self-evolving process. Leveraging this pipeline, we construct ReflectEvo-460k, a large-scale, comprehensive, self-generated reflection dataset with broadened instructions and diverse multi-domain tasks. Building upon this dataset, we demonstrate the effectiveness of reflection learning to improve SLMs' reasoning abilities using SFT and DPO with remarkable performance, substantially boosting Llama-3 from 52.4% to 71.2% and Mistral from 44.4% to 71.1%. It validates that ReflectEvo can rival or even surpass the reasoning capability of the three prominent open-sourced models on BIG-bench without distillation from superior models or fine-grained human annotation. We further conduct a deeper analysis of the high quality of self-generated reflections and their impact on error localization and correction. Our work highlights the potential of continuously enhancing the reasoning performance of SLMs through iterative reflection learning in the long run.

摘要

我们提出了一种新型流程ReflectEvo，证明小语言模型（SLMs）能通过反思学习增强元自省能力。该流程通过迭代生成自我反思进行自训练，形成持续自我进化的过程。基于此，我们构建了ReflectEvo-460k——一个大规模、综合性、自生成的反思数据集，包含扩展指令和多样化的多领域任务。利用该数据集，我们通过监督微调（SFT）和直接偏好优化（DPO）验证了反思学习对提升SLMs推理能力的显著效果：Llama-3的准确率从52.4%提升至71.2%，Mistral从44.4%提升至71.1%。这表明ReflectEvo无需依赖上级模型蒸馏或精细人工标注，即可媲美甚至超越三大知名开源模型在BIG-bench上的推理能力。我们进一步深入分析了自生成反思的高质量特性及其对错误定位与修正的影响。本研究揭示了通过迭代反思学习持续提升SLMs推理性能的长期潜力。

Advancing the Scientific Method with Large Language Models: From Hypothesis to Discovery

Abstract

arXiv:2505.16477v1 Announce Type: new Abstract: With recent Nobel Prizes recognising AI contributions to science, Large Language Models (LLMs) are transforming scientific research by enhancing productivity and reshaping the scientific method. LLMs are now involved in experimental design, data analysis, and workflows, particularly in chemistry and biology. However, challenges such as hallucinations and reliability persist. In this contribution, we review how Large Language Models (LLMs) are redefining the scientific method and explore their potential applications across different stages of the scientific cycle, from hypothesis testing to discovery. We conclude that, for LLMs to serve as relevant and effective creative engines and productivity enhancers, their deep integration into all steps of the scientific process should be pursued in collaboration and alignment with human scientific goals, with clear evaluation metrics. The transition to AI-driven science raises ethical questions about creativity, oversight, and responsibility. With careful guidance, LLMs could evolve into creative engines, driving transformative breakthroughs across scientific disciplines responsibly and effectively. However, the scientific community must also decide how much it leaves to LLMs to drive science, even when associations with 'reasoning', mostly currently undeserved, are made in exchange for the potential to explore hypothesis and solution regions that might otherwise remain unexplored by human exploration alone.

摘要

随着近年诺贝尔奖对人工智能科学贡献的认可，大语言模型（LLMs）正通过提升生产力和重塑科研方法变革科学研究。当前LLMs已参与化学、生物学等领域的实验设计、数据分析和工作流程，但仍存在幻觉与可靠性等挑战。本文系统评述了大语言模型如何重新定义科学方法，并探讨其在假设检验到科学发现等科研周期各阶段的潜在应用。我们得出结论：要使LLMs成为相关且高效的创意引擎与生产力增强工具，需通过明确评估指标，使其深度融入科研全流程并与人类科学目标协同。向AI驱动科学的转型引发了关于创造性、监督与责任的伦理问题。在审慎引导下，LLMs或可发展为负责任且高效的创意引擎，推动跨学科突破性进展。但科学界仍需权衡：即便在探索人类单独研究可能无法触及的假设与解决方案领域时，当以尚不成熟的"推理"能力为交换条件，究竟应让LLMs在多大程度上主导科研进程。

FREESON: Retriever-Free Retrieval-Augmented Reasoning via Corpus-Traversing MCTS

Abstract

arXiv:2505.16409v1 Announce Type: new Abstract: Large Reasoning Models (LRMs) have demonstrated remarkable capabilities in multi-step reasoning and calling search engines at appropriate steps. However, existing retrieval-augmented reasoning approaches rely on separate retrieval models, limiting the LRM's role in retrieval to deciding when to retrieve and how to query. This separation not only increases hardware and operational costs but also leads to errors in the retrieval process due to the representation bottleneck, a phenomenon where the retriever's embedding space is not expressive enough to meet the generator's requirements. To address this, we shift our perspective from sequence-to-sequence matching to locating the answer-containing paths within the corpus, and propose a novel framework called FREESON (Retriever-FREE Retrieval-Augmented ReaSONing). This framework enables LRMs to retrieve relevant knowledge on their own by acting as both a generator and retriever. To achieve this, we introduce a variant of the MCTS algorithm specialized for the retrieval task, which we call CT-MCTS (Corpus-Traversing Monte Carlo Tree Search). In this algorithm, LRMs traverse through the corpus toward answer-containing regions. Our results on five open-domain QA benchmarks, including single-hop and multi-hop questions, show that FREESON achieves an average improvement of 14.4% in EM and F1 over four multi-step reasoning models with a separate retriever, and it also performs comparably to the strongest baseline, surpassing it by 3% on PopQA and 2WikiMultihopQA.

摘要

大型推理模型（LRMs）在多步推理和适时调用搜索引擎方面展现出卓越能力。然而现有检索增强推理方法依赖独立的检索模型，将LRMs在检索中的作用局限于决定检索时机与查询方式。这种分离不仅增加了硬件与运维成本，更因表征瓶颈现象（即检索器嵌入空间无法充分满足生成器需求）导致检索过程出现误差。为此，我们突破序列到序列匹配的范式，转向在语料库中定位包含答案的路径，提出名为FREESON（无检索器的检索增强推理）的新型框架。该框架通过使LRMs兼具生成器与检索器功能，实现自主检索相关知识。为此，我们提出专用于检索任务的MCTS算法变体——CT-MCTS（语料库遍历蒙特卡洛树搜索），在该算法中LRMs沿语料库向答案所在区域进行遍历。在五个开放域QA基准（含单跳与多跳问题）上的实验表明：相较于四个配备独立检索器的多步推理模型，FREESON在EM和F1指标上平均提升14.4%；其性能与最强基线相当，并在PopQA和2WikiMultihopQA上分别超出3%与2%。

Edge-First Language Model Inference: Models, Metrics, and Tradeoffs

Abstract

arXiv:2505.16508v1 Announce Type: new Abstract: The widespread adoption of Language Models (LMs) across industries is driving interest in deploying these services across the computing continuum, from the cloud to the network edge. This shift aims to reduce costs, lower latency, and improve reliability and privacy. Small Language Models (SLMs), enabled by advances in model compression, are central to this shift, offering a path to on-device inference on resource-constrained edge platforms. This work examines the interplay between edge and cloud deployments, starting from detailed benchmarking of SLM capabilities on single edge devices, and extending to distributed edge clusters. We identify scenarios where edge inference offers comparable performance with lower costs, and others where cloud fallback becomes essential due to limits in scalability or model capacity. Rather than proposing a one-size-fits-all solution, we present platform-level comparisons and design insights for building efficient, adaptive LM inference systems across heterogeneous environments.

摘要

语言模型（LMs）在各行业的广泛应用推动了人们将其服务部署于从云端到网络边缘的整个计算连续体中的兴趣。这一转变旨在降低成本、减少延迟，并提升可靠性和隐私性。得益于模型压缩技术的进步，小型语言模型（SLMs）成为这一转变的核心，为资源受限的边缘平台提供了设备端推理的途径。本研究探讨了边缘与云端部署之间的相互作用，从单一边缘设备上SLM能力的详细基准测试出发，延伸至分布式边缘集群。我们识别了边缘推理在性能相当且成本更低时的适用场景，以及由于可扩展性或模型容量限制而必须依赖云端回退的其他场景。我们并未提出一刀切的解决方案，而是提供了平台级比较和设计见解，以构建跨异构环境的高效、自适应LM推理系统。

Abstract

arXiv:2505.16459v1 Announce Type: new Abstract: Recent advances in Multi-Modal Large Language Models (MLLMs) have enabled unified processing of language, vision, and structured inputs, opening the door to complex tasks such as logical deduction, spatial reasoning, and scientific analysis. Despite their promise, the reasoning capabilities of MLLMs, particularly those augmented with intermediate thinking traces (MLLMs-T), remain poorly understood and lack standardized evaluation benchmarks. Existing work focuses primarily on perception or final answer correctness, offering limited insight into how models reason or fail across modalities. To address this gap, we introduce the MMMR, a new benchmark designed to rigorously evaluate multi-modal reasoning with explicit thinking. The MMMR comprises 1) a high-difficulty dataset of 1,083 questions spanning six diverse reasoning types with symbolic depth and multi-hop demands and 2) a modular Reasoning Trace Evaluation Pipeline (RTEP) for assessing reasoning quality beyond accuracy through metrics like relevance, consistency, and structured error annotations. Empirical results show that MLLMs-T overall outperform non-thinking counterparts, but even top models like Claude-3.7-Sonnet and Gemini-2.5 Pro suffer from reasoning pathologies such as inconsistency and overthinking. This benchmark reveals persistent gaps between accuracy and reasoning quality and provides an actionable evaluation pipeline for future model development. Overall, the MMMR offers a scalable foundation for evaluating, comparing, and improving the next generation of multi-modal reasoning systems.

摘要

多模态大语言模型(MLLMs)的最新进展实现了对语言、视觉和结构化输入的统一处理，为逻辑推理、空间推理和科学分析等复杂任务开辟了道路。尽管前景广阔，但MLLMs（特别是增强中间思维轨迹的MLLMs-T）的推理能力仍未被充分理解，且缺乏标准化评估基准。现有研究主要关注感知或最终答案的正确性，对模型跨模态的推理过程或失败原因提供有限洞察。为填补这一空白，我们提出了MMMR基准——一个专门用于严格评估显性思维多模态推理的新基准。该基准包含：1) 一个高难度数据集，涵盖六种具有符号深度和多跳需求的多样化推理类型，共1,083个问题；2) 模块化推理轨迹评估管道(RTEP)，通过相关性、一致性和结构化错误标注等指标，超越准确率评估推理质量。实验结果表明，MLLMs-T总体优于非思维增强模型，但即使是Claude-3.7-Sonnet和Gemini-2.5 Pro等顶级模型仍存在不一致性和过度思考等推理缺陷。该基准揭示了准确率与推理质量之间的持续差距，并为未来模型开发提供了可操作的评估框架。总体而言，MMMR为评估、比较和改进下一代多模态推理系统提供了可扩展的基础。

Recursive Offloading for LLM Serving in Multi-tier Networks

Abstract

arXiv:2505.16502v1 Announce Type: new Abstract: Heterogeneous device-edge-cloud computing infrastructures have become widely adopted in telecommunication operators and Wide Area Networks (WANs), offering multi-tier computational support for emerging intelligent services. With the rapid proliferation of Large Language Model (LLM) services, efficiently coordinating inference tasks and reducing communication overhead within these multi-tier network architectures becomes a critical deployment challenge. Existing LLM serving paradigms exhibit significant limitations: on-device deployment supports only lightweight LLMs due to hardware constraints, while cloud-centric deployment suffers from resource congestion and considerable prompt communication overhead caused by frequent service requests during peak periods. Although the model-cascading-based inference strategy adapts better to multi-tier networks, its reliance on fine-grained, manually adjusted thresholds makes it less responsive to dynamic network conditions and varying task complexities. To address these challenges, we propose RecServe, a recursive offloading framework tailored for LLM serving in multi-tier networks. RecServe integrates a task-specific hierarchical confidence evaluation mechanism that guides offloading decisions based on inferred task complexity in progressively scaled LLMs across device, edge, and cloud tiers. To further enable intelligent task routing across tiers, RecServe employs a sliding-window-based dynamic offloading strategy with quantile interpolation, enabling real-time tracking of historical confidence distributions and adaptive offloading threshold adjustments. Experiments on eight datasets demonstrate that RecServe outperforms CasServe in both service quality and communication efficiency, and reduces the communication burden by over 50% compared to centralized cloud-based serving.

摘要

异构设备-边缘-云计算基础设施已在电信运营商和广域网(WAN)中得到广泛应用，为新兴智能服务提供多层次计算支持。随着大语言模型(LLM)服务的快速普及，如何在这种多层网络架构中高效协调推理任务并降低通信开销成为关键部署挑战。现有LLM服务范式存在显著局限：受硬件限制，设备端部署仅支持轻量级LLM；而以云为中心的部署则面临资源拥塞和高峰时段频繁服务请求导致的巨大提示词通信开销。虽然基于模型级联的推理策略更适应多层网络，但其依赖细粒度人工调整阈值的方式难以响应动态网络条件和多变任务复杂度。为此，我们提出RecServe——一个专为多层网络LLM服务设计的递归卸载框架。该框架整合了面向任务的分层置信度评估机制，通过跨设备、边缘和云层级逐步扩展的LLM来推断任务复杂度，从而指导卸载决策。为进一步实现跨层级智能任务路由，RecServe采用基于滑动窗口的分位数插值动态卸载策略，实时追踪历史置信度分布并自适应调整卸载阈值。在八个数据集上的实验表明，RecServe在服务质量和通信效率上均优于CasServe，相比集中式云服务可降低50%以上的通信负担。

Is Your LLM-Based Multi-Agent a Reliable Real-World Planner? Exploring Fraud Detection in Travel Planning

Abstract

arXiv:2505.16557v1 Announce Type: new Abstract: The rise of Large Language Model-based Multi-Agent Planning has leveraged advanced frameworks to enable autonomous and collaborative task execution. Some systems rely on platforms like review sites and social media, which are prone to fraudulent information, such as fake reviews or misleading descriptions. This reliance poses risks, potentially causing financial losses and harming user experiences. To evaluate the risk of planning systems in real-world applications, we introduce \textbf{WandaPlan}, an evaluation environment mirroring real-world data and injected with deceptive content. We assess system performance across three fraud cases: Misinformation Fraud, Team-Coordinated Multi-Person Fraud, and Level-Escalating Multi-Round Fraud. We reveal significant weaknesses in existing frameworks that prioritize task efficiency over data authenticity. At the same time, we validate WandaPlan's generalizability, capable of assessing the risks of real-world open-source planning frameworks. To mitigate the risk of fraud, we propose integrating an anti-fraud agent, providing a solution for reliable planning.

摘要

基于大语言模型的多智能体规划系统的兴起，利用先进框架实现了自主协作的任务执行。现有系统多依赖点评网站和社交媒体等易受欺诈信息（如虚假评论或误导性描述）影响的平台，这种依赖性可能引发财务损失和损害用户体验的风险。为评估规划系统在现实应用中的风险，我们提出WandaPlan评估环境，该环境模拟真实数据并注入欺骗性内容。我们通过三类欺诈案例（虚假信息欺诈、团队协作多人欺诈、层级递进多轮欺诈）评估系统性能，发现现有框架因优先考虑任务效率而忽视数据真实性存在重大缺陷。同时验证了WandaPlan的泛化能力，可有效评估现实开源规划框架的风险。为降低欺诈风险，我们提出集成反欺诈智能体的方案，为可靠规划提供解决路径。

Bridging the Dynamic Perception Gap: Training-Free Draft Chain-of-Thought for Dynamic Multimodal Spatial Reasoning

Abstract

arXiv:2505.16579v1 Announce Type: new Abstract: While chains-of-thought (CoT) have advanced complex reasoning in multimodal large language models (MLLMs), existing methods remain confined to text or static visual domains, often faltering in dynamic spatial reasoning tasks. To bridge this gap, we present GRASSLAND, a novel maze navigation benchmark designed to evaluate dynamic spatial reasoning. Our experiments show that augmenting textual reasoning chains with dynamic visual drafts, overlaid on input images, significantly outperforms conventional approaches, offering new insights into spatial reasoning in evolving environments. To generalize this capability, we propose D2R (Dynamic Draft-Augmented Reasoning), a training-free framework that seamlessly integrates textual CoT with corresponding visual drafts into MLLMs. Extensive evaluations demonstrate that D2R consistently enhances performance across diverse tasks, establishing a robust baseline for dynamic spatial reasoning without requiring model fine-tuning. Project is open at https://github.com/Cratileo/D2R.

摘要

尽管思维链（CoT）技术在多模态大语言模型（MLLMs）中推动了复杂推理的发展，但现有方法仍局限于文本或静态视觉领域，在动态空间推理任务中往往表现不佳。为弥补这一不足，我们提出了GRASSLAND——一个专为评估动态空间推理而设计的新型迷宫导航基准测试。实验表明，通过在输入图像上叠加动态视觉草图来增强文本推理链，能显著超越传统方法，为动态环境中的空间推理提供了新见解。为推广这一能力，我们提出D2R（动态草图增强推理），这是一种免训练框架，可将文本CoT与相应视觉草图无缝集成到MLLMs中。大量评估证明，D2R能持续提升各类任务的性能，在不需模型微调的情况下为动态空间推理建立了稳健的基准。项目开源地址：https://github.com/Cratileo/D2R。

SMART: Self-Generating and Self-Validating Multi-Dimensional Assessment for LLMs' Mathematical Problem Solving

Abstract

arXiv:2505.16646v1 Announce Type: new Abstract: Large Language Models have achieved remarkable results on a variety of mathematical benchmarks. However, concerns remain as to whether these successes reflect genuine mathematical reasoning or superficial pattern recognition. Common evaluation metrics, such as final answer accuracy, fail to disentangle the underlying competencies involved, offering limited diagnostic value. To address these limitations, we introduce SMART: a Self-Generating and Self-Validating Multi-Dimensional Assessment Framework. SMART decomposes mathematical problem solving into four distinct dimensions: understanding, reasoning, arithmetic, and reflection & refinement. Each dimension is evaluated independently through tailored tasks, enabling interpretable and fine-grained analysis of LLM behavior. Crucially, SMART integrates an automated self-generating and self-validating mechanism to produce and verify benchmark data, ensuring both scalability and reliability. We apply SMART to 21 state-of-the-art open- and closed-source LLMs, uncovering significant discrepancies in their abilities across different dimensions. Our findings demonstrate the inadequacy of final answer accuracy as a sole metric and motivate a new holistic metric to better capture true problem-solving capabilities. Code and benchmarks will be released upon acceptance.

摘要

大语言模型在各类数学基准测试中取得了显著成果。然而，这些成功究竟反映真实的数学推理能力还是表面的模式识别，仍存疑虑。现有常用评估指标（如最终答案准确率）无法区分潜在的核心能力要素，诊断价值有限。为此，我们提出SMART框架：一种自生成自验证的多维评估体系。该框架将数学问题解决分解为四个独立维度——理解、推理、算术以及反思与优化，通过定制化任务对各维度进行独立评估，从而实现对大语言模型行为可解释、细粒度的分析。关键创新在于整合了自动化自生成与自验证机制来生产并校验基准数据，确保评估的可扩展性与可靠性。我们对21个最先进的开源与闭源大语言模型进行测试，发现不同维度能力存在显著差异。研究结果证明仅凭最终答案准确率作为单一指标的不足，并推动建立新的综合评价指标以更准确捕捉真实问题解决能力。代码与基准测试数据将在论文录用后公开发布。

ELABORATION: A Comprehensive Benchmark on Human-LLM Competitive Programming

Abstract

arXiv:2505.16667v1 Announce Type: new Abstract: While recent research increasingly emphasizes the value of human-LLM collaboration in competitive programming and proposes numerous empirical methods, a comprehensive understanding remains elusive due to the fragmented nature of existing studies and their use of diverse, application-specific human feedback. Thus, our work serves a three-fold purpose: First, we present the first taxonomy of human feedback consolidating the entire programming process, which promotes fine-grained evaluation. Second, we introduce ELABORATIONSET, a novel programming dataset specifically designed for human-LLM collaboration, meticulously annotated to enable large-scale simulated human feedback and facilitate costeffective real human interaction studies. Third, we introduce ELABORATION, a novel benchmark to facilitate a thorough assessment of human-LLM competitive programming. With ELABORATION, we pinpoint strengthes and weaknesses of existing methods, thereby setting the foundation for future improvement. Our code and dataset are available at https://github.com/SCUNLP/ELABORATION

摘要

尽管近期研究日益强调人类与大型语言模型（LLM）在竞技编程中协作的价值，并提出了多种实证方法，但由于现有研究呈现碎片化特征且采用多样化的应用特定人类反馈，全面理解仍显不足。为此，本研究实现三重目标：首先，我们提出首个整合完整编程流程的人类反馈分类体系，支持细粒度评估。其次，我们推出ELABORATIONSET——一个专为人类-LLM协作设计的新型编程数据集，通过精细标注支持大规模模拟人类反馈，并为经济高效的真人交互研究提供基础。第三，我们建立ELABORATION基准测试，以系统评估人类-LLM竞技编程表现。借助该基准，我们精准识别现有方法的优势与不足，为未来改进奠定基础。代码与数据集详见https://github.com/SCUNLP/ELABORATION。

Data-Driven Breakthroughs and Future Directions in AI Infrastructure: A Comprehensive Review

Abstract

arXiv:2505.16771v1 Announce Type: new Abstract: This paper presents a comprehensive synthesis of major breakthroughs in artificial intelligence (AI) over the past fifteen years, integrating historical, theoretical, and technological perspectives. It identifies key inflection points in AI' s evolution by tracing the convergence of computational resources, data access, and algorithmic innovation. The analysis highlights how researchers enabled GPU based model training, triggered a data centric shift with ImageNet, simplified architectures through the Transformer, and expanded modeling capabilities with the GPT series. Rather than treating these advances as isolated milestones, the paper frames them as indicators of deeper paradigm shifts. By applying concepts from statistical learning theory such as sample complexity and data efficiency, the paper explains how researchers translated breakthroughs into scalable solutions and why the field must now embrace data centric approaches. In response to rising privacy concerns and tightening regulations, the paper evaluates emerging solutions like federated learning, privacy enhancing technologies (PETs), and the data site paradigm, which reframe data access and security. In cases where real world data remains inaccessible, the paper also assesses the utility and constraints of mock and synthetic data generation. By aligning technical insights with evolving data infrastructure, this study offers strategic guidance for future AI research and policy development.

摘要

本文对过去十五年间人工智能(AI)领域的重大突破进行了全面综合，整合了历史、理论和技术的多维视角。通过追踪计算资源、数据获取与算法创新的融合轨迹，研究界定了AI演进过程中的关键转折点。分析着重阐释了研究者如何实现基于GPU的模型训练、通过ImageNet引发以数据为中心的范式转移、借助Transformer简化架构，以及利用GPT系列拓展建模能力。论文并未将这些进展视为孤立里程碑，而是将其作为深层范式转变的指示标。通过运用统计学习理论中的样本复杂度和数据效率等概念，研究揭示了突破性成果如何转化为可扩展解决方案，并阐明了该领域为何必须转向以数据为中心的方法。针对日益增长的隐私顾虑与监管收紧，论文评估了联邦学习、隐私增强技术(PETs)以及重构数据访问与安全的数据站点范式等新兴解决方案。对于现实数据难以获取的场景，研究还评估了模拟与合成数据生成的效用与限制。通过将技术洞见与演进中的数据基础设施相衔接，本研究为未来AI研究与政策制定提供了战略指引。

MCP-RADAR: A Multi-Dimensional Benchmark for Evaluating Tool Use Capabilities in Large Language Models

Abstract

arXiv:2505.16700v1 Announce Type: new Abstract: As Large Language Models (LLMs) evolve from passive text generators to active reasoning agents capable of tool interaction, the Model Context Protocol (MCP) has emerged as a standardized framework for dynamic tool discovery and orchestration. Despite widespread industry adoption, existing evaluation methodologies fail to adequately assess tool utilization capabilities within this new paradigm. This paper introduces MCP-RADAR, the first comprehensive benchmark specifically designed to evaluate LLM performance in the MCP framework through a novel five-dimensional approach measuring: answer accuracy, tool selection efficiency, computational resource efficiency, parameter construction accuracy, and execution speed. Unlike conventional benchmarks that rely on subjective human evaluations or binary success metrics, MCP-RADAR employs objective, quantifiable measurements across multiple task domains including software engineering, mathematical reasoning, and general problem-solving. Our evaluations of leading commercial and open-source LLMs reveal distinctive capability profiles with significant trade-offs between accuracy, efficiency, and speed, challenging traditional single-metric performance rankings. Besides, we provide valuable guidance for developers to optimize their tools for maximum model compatibility and effectiveness. While focused on MCP due to its standardized approach, our methodology remains applicable across all LLM agent tool integration frameworks, providing valuable insights for both LLM developers and tool creators to optimize the entire LLM-tool interaction ecosystem. The implementation, configurations, and datasets used in our evaluation are publicly available at https://anonymous.4open.science/r/MCPRadar-B143.

摘要

随着大型语言模型(LLMs)从被动文本生成器发展为具备工具交互能力的主动推理智能体，模型上下文协议(MCP)已成为动态工具发现与编排的标准化框架。尽管该框架已在工业界广泛应用，现有评估方法仍无法充分衡量这一新范式下的工具利用能力。本文提出首个专为MCP框架设计的综合基准测试MCP-RADAR，通过创新性的五维评估体系进行性能度量：答案准确性、工具选择效率、计算资源效率、参数构建准确性和执行速度。与传统依赖主观人工评估或二元成功指标的基准不同，MCP-RADAR采用客观量化指标，覆盖软件工程、数学推理和通用问题求解等多任务领域。我们对主流商业及开源LLMs的评估揭示了各模型在准确性、效率与速度之间存在显著权衡的独特能力特征，这对传统单一指标性能排名提出了挑战。此外，我们为开发者提供了优化工具以实现最大模型兼容性和有效性的实用指南。虽然研究聚焦于标准化的MCP框架，但该方法论可适用于所有LLM智能体工具集成框架，为LLM开发者和工具创建者优化整体交互生态系统提供了重要参考。评估所用的实现方案、配置及数据集已公开于https://anonymous.4open.science/r/MCPRadar-B143。

KTAE: A Model-Free Algorithm to Key-Tokens Advantage Estimation in Mathematical Reasoning

Abstract

arXiv:2505.16826v1 Announce Type: new Abstract: Recent advances have demonstrated that integrating reinforcement learning with rule-based rewards can significantly enhance the reasoning capabilities of large language models, even without supervised fine-tuning. However, prevalent reinforcement learning algorithms such as GRPO and its variants like DAPO, suffer from a coarse granularity issue when computing the advantage. Specifically, they compute rollout-level advantages that assign identical values to every token within a sequence, failing to capture token-specific contributions and hindering effective learning. To address this limitation, we propose Key-token Advantage Estimation (KTAE) - a novel algorithm that estimates fine-grained, token-level advantages without introducing additional models. KTAE leverages the correctness of sampled rollouts and applies statistical analysis to quantify the importance of individual tokens within a sequence to the final outcome. This quantified token-level importance is then combined with the rollout-level advantage to obtain a more fine-grained token-level advantage estimation. Empirical results show that models trained with GRPO+KTAE and DAPO+KTAE outperform baseline methods across five mathematical reasoning benchmarks. Notably, they achieve higher accuracy with shorter responses and even surpass R1-Distill-Qwen-1.5B using the same base model.

摘要

近期研究表明，将强化学习与基于规则的奖励相结合，即使无需监督微调，也能显著增强大语言模型的推理能力。然而，当前主流强化学习算法如GRPO及其变体DAPO在计算优势值时存在粒度粗放问题。具体而言，这些算法通过序列级优势计算为同一序列中的所有标记分配相同值，无法捕捉标记级贡献，从而阻碍有效学习。为突破这一局限，我们提出关键标记优势估计（KTAE）——一种无需引入额外模型即可实现细粒度标记级优势估计的新算法。KTAE通过统计分析方法，利用采样序列的正确性量化序列中单个标记对最终结果的贡献度，并将该量化结果与序列级优势值结合，获得更精细的标记级优势估计。实验结果表明，采用GRPO+KTAE和DAPO+KTAE训练的模型在五项数学推理基准测试中均超越基线方法。值得注意的是，这些模型能以更短的响应长度实现更高准确率，甚至在使用相同基础模型时超越R1-Distill-Qwen-1.5B。

Beyond Correlation: Towards Causal Large Language Model Agents in Biomedicine

Abstract

arXiv:2505.16982v1 Announce Type: new Abstract: Large Language Models (LLMs) show promise in biomedicine but lack true causal understanding, relying instead on correlations. This paper envisions causal LLM agents that integrate multimodal data (text, images, genomics, etc.) and perform intervention-based reasoning to infer cause-and-effect. Addressing this requires overcoming key challenges: designing safe, controllable agentic frameworks; developing rigorous benchmarks for causal evaluation; integrating heterogeneous data sources; and synergistically combining LLMs with structured knowledge (KGs) and formal causal inference tools. Such agents could unlock transformative opportunities, including accelerating drug discovery through automated hypothesis generation and simulation, enabling personalized medicine through patient-specific causal models. This research agenda aims to foster interdisciplinary efforts, bridging causal concepts and foundation models to develop reliable AI partners for biomedical progress.

摘要

大型语言模型（LLMs）在生物医学领域展现出潜力，但其依赖相关性而非真正的因果理解。本文提出构建因果型LLM智能体的愿景，通过整合多模态数据（文本、图像、基因组学等）并进行基于干预的推理来实现因果关系推断。实现这一目标需攻克以下关键挑战：设计安全可控的智能体框架、建立严格的因果评估基准、融合异构数据源，以及协同结合LLMs与结构化知识图谱（KGs）和形式化因果推理工具。此类智能体有望开启变革性机遇，包括通过自动化假设生成与模拟加速药物发现、基于患者特异性因果模型实现个性化医疗。本研究议程旨在促进跨学科合作， bridging 因果概念与基础模型，为生物医学进步开发可靠的人工智能合作伙伴。

Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models

Abstract

arXiv:2505.16854v1 Announce Type: new Abstract: Reinforcement Learning (RL) has proven to be an effective post-training strategy for enhancing reasoning in vision-language models (VLMs). Group Relative Policy Optimization (GRPO) is a recent prominent method that encourages models to generate complete reasoning traces before answering, leading to increased token usage and computational cost. Inspired by the human-like thinking process-where people skip reasoning for easy questions but think carefully when needed-we explore how to enable VLMs to first decide when reasoning is necessary. To realize this, we propose TON, a two-stage training strategy: (i) a supervised fine-tuning (SFT) stage with a simple yet effective 'thought dropout' operation, where reasoning traces are randomly replaced with empty thoughts. This introduces a think-or-not format that serves as a cold start for selective reasoning; (ii) a GRPO stage that enables the model to freely explore when to think or not, while maximizing task-aware outcome rewards. Experimental results show that TON can reduce the completion length by up to 90% compared to vanilla GRPO, without sacrificing performance or even improving it. Further evaluations across diverse vision-language tasks-covering a range of reasoning difficulties under both 3B and 7B models-consistently reveal that the model progressively learns to bypass unnecessary reasoning steps as training advances. These findings shed light on the path toward human-like reasoning patterns in reinforcement learning approaches. Our code is available at https://github.com/kokolerk/TON.

摘要

强化学习（RL）已被证明是一种有效的后训练策略，可增强视觉语言模型（VLM）的推理能力。组相对策略优化（GRPO）是近期的一种重要方法，它鼓励模型在回答前生成完整的推理轨迹，但这会导致标记使用量和计算成本增加。受人类思维过程的启发——人们在简单问题上跳过推理，而在需要时仔细思考——我们探索如何让VLM首先决定何时需要推理。为实现这一目标，我们提出了TON，一种两阶段训练策略：（i）监督微调（SFT）阶段，采用简单而有效的“思维丢弃”操作，随机将推理轨迹替换为空思维。这引入了一种“思考与否”的格式，为选择性推理提供了冷启动；（ii）GRPO阶段，使模型能够自由探索何时思考或不思考，同时最大化任务感知的结果奖励。实验结果表明，与原始GRPO相比，TON可将完成长度减少高达90%，且不会牺牲性能甚至有所提升。在多种视觉语言任务（涵盖3B和7B模型下不同推理难度）的进一步评估中，一致发现模型随着训练的推进逐渐学会跳过不必要的推理步骤。这些发现为强化学习方法中实现类人推理模式提供了启示。我们的代码可在https://github.com/kokolerk/TON获取。

HyGenar: An LLM-Driven Hybrid Genetic Algorithm for Few-Shot Grammar Generation

Abstract

arXiv:2505.16978v1 Announce Type: new Abstract: Grammar plays a critical role in natural language processing and text/code generation by enabling the definition of syntax, the creation of parsers, and guiding structured outputs. Although large language models (LLMs) demonstrate impressive capabilities across domains, their ability to infer and generate grammars has not yet been thoroughly explored. In this paper, we aim to study and improve the ability of LLMs for few-shot grammar generation, where grammars are inferred from sets of a small number of positive and negative examples and generated in Backus-Naur Form. To explore this, we introduced a novel dataset comprising 540 structured grammar generation challenges, devised 6 metrics, and evaluated 8 various LLMs against it. Our findings reveal that existing LLMs perform sub-optimally in grammar generation. To address this, we propose an LLM-driven hybrid genetic algorithm, namely HyGenar, to optimize grammar generation. HyGenar achieves substantial improvements in both the syntactic and semantic correctness of generated grammars across LLMs.

摘要

语法在自然语言处理和文本/代码生成中具有关键作用，它能够定义句法结构、创建解析器并指导结构化输出。尽管大语言模型（LLMs）在各领域展现出卓越能力，但其推断和生成语法的能力尚未得到深入探索。本文旨在研究并提升LLMs在小样本语法生成中的能力，即从少量正负示例中推断语法并以巴科斯-诺尔范式生成。为此，我们构建了一个包含540项结构化语法生成挑战的新数据集，设计了6项评估指标，并对8种不同LLMs进行了系统测试。研究发现现有LLMs在语法生成任务中表现欠佳。针对此问题，我们提出了一种基于LLM的混合遗传算法HyGenar来优化语法生成。实验表明，HyGenar能显著提升跨模型生成语法在句法和语义层面的正确性。

Know the Ropes: A Heuristic Strategy for LLM-based Multi-Agent System Design

Abstract

arXiv:2505.16979v1 Announce Type: new Abstract: Single-agent LLMs hit hard limits--finite context, role overload, and brittle domain transfer. Conventional multi-agent fixes soften those edges yet expose fresh pains: ill-posed decompositions, fuzzy contracts, and verification overhead that blunts the gains. We therefore present Know-The-Ropes (KtR), a framework that converts domain priors into an algorithmic blueprint hierarchy, in which tasks are recursively split into typed, controller-mediated subtasks, each solved zero-shot or with the lightest viable boost (e.g., chain-of-thought, micro-tune, self-check). Grounded in the No-Free-Lunch theorem, KtR trades the chase for a universal prompt for disciplined decomposition. On the Knapsack problem (3-8 items), three GPT-4o-mini agents raise accuracy from 3% zero-shot to 95% on size-5 instances after patching a single bottleneck agent. On the tougher Task-Assignment problem (6-15 jobs), a six-agent o3-mini blueprint hits 100% up to size 10 and 84% on sizes 13-15, versus 11% zero-shot. Algorithm-aware decomposition plus targeted augmentation thus turns modest models into reliable collaborators--no ever-larger monoliths required.

摘要

单智能体大语言模型面临三大瓶颈：有限上下文容量、角色过载和脆弱的领域迁移能力。传统多智能体方案虽能缓解这些问题，却引入了新痛点：任务分解失当、合约定义模糊以及验证开销抵消性能增益。为此，我们提出Know-The-Ropes（KtR）框架，将领域先验转化为算法蓝图层级结构，通过递归分解为类型化、控制器协调的子任务，采用零样本或最小可行增强策略（如思维链、微调、自检）求解。基于"没有免费午餐"定理，KtR摒弃通用提示词的追求，转向结构化任务分解。在背包问题（3-8物品）中，三个GPT-4o-mini智能体通过修补单个瓶颈节点，将5物品实例的准确率从零样本的3%提升至95%。在更复杂的任务分配问题（6-15作业）中，六智能体o3-mini蓝图在10作业规模实现100%准确率，13-15作业规模达84%，远超零样本11%的表现。算法感知分解结合精准增强，使中等模型即可成为可靠协作体——无需持续堆砌巨型单体模型。

Problem-Solving Logic Guided Curriculum In-Context Learning for LLMs Complex Reasoning

Abstract

arXiv:2502.15401v1 Announce Type: cross Abstract: In-context learning (ICL) can significantly enhance the complex reasoning capabilities of large language models (LLMs), with the key lying in the selection and ordering of demonstration examples. Previous methods typically relied on simple features to measure the relevance between examples. We argue that these features are not sufficient to reflect the intrinsic connections between examples. In this study, we propose a curriculum ICL strategy guided by problem-solving logic. We select demonstration examples by analyzing the problem-solving logic and order them based on curriculum learning. Specifically, we constructed a problem-solving logic instruction set based on the BREAK dataset and fine-tuned a language model to analyze the problem-solving logic of examples. Subsequently, we selected appropriate demonstration examples based on problem-solving logic and assessed their difficulty according to the number of problem-solving steps. In accordance with the principles of curriculum learning, we ordered the examples from easy to hard to serve as contextual prompts. Experimental results on multiple benchmarks indicate that our method outperforms previous ICL approaches in terms of performance and efficiency, effectively enhancing the complex reasoning capabilities of LLMs. Our project will be publicly available subsequently.

摘要

上下文学习（ICL）能显著增强大语言模型（LLMs）的复杂推理能力，其核心在于演示样例的选择与排序。现有方法通常依赖简单特征衡量样例间相关性，我们认为这些特征不足以反映样例间的内在联系。本研究提出一种基于问题解决逻辑的课程式ICL策略：通过分析问题解决逻辑选择演示样例，并依据课程学习原则进行排序。具体而言，我们基于BREAK数据集构建问题解决逻辑指令集，微调语言模型以解析样例的问题解决逻辑；随后根据逻辑匹配度筛选演示样例，并依据解题步骤数量评估难度。遵循课程学习原理，将样例按从易到难排序作为上下文提示。多基准测试表明，本方法在性能与效率上均优于现有ICL方案，能有效提升LLMs的复杂推理能力。项目代码后续将公开。

X-MAS: Towards Building Multi-Agent Systems with Heterogeneous LLMs

Abstract

arXiv:2505.16997v1 Announce Type: new Abstract: LLM-based multi-agent systems (MAS) extend the capabilities of single LLMs by enabling cooperation among multiple specialized agents. However, most existing MAS frameworks rely on a single LLM to drive all agents, constraining the system's intelligence to the limit of that model. This paper explores the paradigm of heterogeneous LLM-driven MAS (X-MAS), where agents are powered by diverse LLMs, elevating the system's potential to the collective intelligence of diverse LLMs. We introduce X-MAS-Bench, a comprehensive testbed designed to evaluate the performance of various LLMs across different domains and MAS-related functions. As an extensive empirical study, we assess 27 LLMs across 5 domains (encompassing 21 test sets) and 5 functions, conducting over 1.7 million evaluations to identify optimal model selections for each domain-function combination. Building on these findings, we demonstrate that transitioning from homogeneous to heterogeneous LLM-driven MAS can significantly enhance system performance without requiring structural redesign. Specifically, in a chatbot-only MAS scenario, the heterogeneous configuration yields up to 8.4% performance improvement on the MATH dataset. In a mixed chatbot-reasoner scenario, the heterogeneous MAS could achieve a remarkable 47% performance boost on the AIME dataset. Our results underscore the transformative potential of heterogeneous LLMs in MAS, highlighting a promising avenue for advancing scalable, collaborative AI systems.

摘要

基于大语言模型（LLM）的多智能体系统（MAS）通过多个专业化智能体的协作，扩展了单一LLM的能力。然而，现有大多数MAS框架依赖单一LLM驱动所有智能体，将系统智能限制在该模型的能力范围内。本文探索异构LLM驱动的多智能体系统（X-MAS）范式，其中智能体由多样化LLM驱动，将系统潜力提升至多样化LLM的集体智能水平。我们提出X-MAS-Bench——一个旨在评估不同领域及MAS相关功能中各类LLM性能的综合测试平台。作为一项大规模实证研究，我们在5个领域（涵盖21个测试集）和5种功能上评估了27个LLM，通过超过170万次测试确定各领域-功能组合的最优模型选择。基于这些发现，我们证明从同构转向异构LLM驱动的MAS可显著提升系统性能，而无需结构重构。具体而言，在纯聊天机器人MAS场景中，异构配置使MATH数据集上的性能提升达8.4%；在混合聊天机器人-推理器场景中，异构MAS可在AIME数据集上实现47%的显著性能提升。这些结果揭示了异构LLM在MAS中的变革潜力，为推进可扩展的协作式AI系统指明了一条前景广阔的路径。

AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios

Abstract

arXiv:2505.16944v1 Announce Type: new Abstract: Large Language Models (LLMs) have demonstrated advanced capabilities in real-world agentic applications. Growing research efforts aim to develop LLM-based agents to address practical demands, introducing a new challenge: agentic scenarios often involve lengthy instructions with complex constraints, such as extended system prompts and detailed tool specifications. While adherence to such instructions is crucial for agentic applications, whether LLMs can reliably follow them remains underexplored. In this paper, we introduce AgentIF, the first benchmark for systematically evaluating LLM instruction following ability in agentic scenarios. AgentIF features three key characteristics: (1) Realistic, constructed from 50 real-world agentic applications. (2) Long, averaging 1,723 words with a maximum of 15,630 words. (3) Complex, averaging 11.9 constraints per instruction, covering diverse constraint types, such as tool specifications and condition constraints. To construct AgentIF, we collect 707 human-annotated instructions across 50 agentic tasks from industrial application agents and open-source agentic systems. For each instruction, we annotate the associated constraints and corresponding evaluation metrics, including code-based evaluation, LLM-based evaluation, and hybrid code-LLM evaluation. We use AgentIF to systematically evaluate existing advanced LLMs. We observe that current models generally perform poorly, especially in handling complex constraint structures and tool specifications. We further conduct error analysis and analytical experiments on instruction length and meta constraints, providing some findings about the failure modes of existing LLMs. We have released the code and data to facilitate future research.

摘要

大语言模型（LLMs）在现实世界的代理应用中已展现出先进能力。越来越多的研究致力于开发基于LLM的代理以满足实际需求，这带来了一项新挑战：代理场景通常涉及包含复杂约束的长篇指令，例如冗长的系统提示和详细的工具规范。尽管遵循此类指令对代理应用至关重要，但LLM是否能可靠地执行它们仍未得到充分探索。本文提出了AgentIF，这是首个系统评估LLM在代理场景中指令遵循能力的基准。AgentIF具有三个关键特征：（1）真实性，基于50个真实世界代理应用构建；（2）长度，平均1,723词，最长15,630词；（3）复杂性，每条指令平均包含11.9个约束，涵盖工具规范、条件约束等多样类型。为构建AgentIF，我们从工业应用代理和开源代理系统中收集了50个代理任务的707条人工标注指令，并为每条指令标注了相关约束及对应评估指标（包括基于代码的评估、基于LLM的评估以及混合代码-LLM评估）。通过AgentIF对现有先进LLM进行系统评估后，我们发现当前模型整体表现欠佳，尤其在处理复杂约束结构和工具规范时。我们进一步对指令长度和元约束进行了错误分析及实验研究，揭示了现有LLM的若干失效模式。相关代码和数据已开源以促进未来研究。

Transforming Decoder-Only Transformers for Accurate WiFi-Telemetry Based Indoor Localization

Abstract

arXiv:2505.15835v1 Announce Type: cross Abstract: Wireless Fidelity (WiFi) based indoor positioning is a widely researched area for determining the position of devices within a wireless network. Accurate indoor location has numerous applications, such as asset tracking and indoor navigation. Despite advances in WiFi localization techniques -- in particular approaches that leverage WiFi telemetry -- their adoption in practice remains limited due to several factors including environmental changes that cause signal fading, multipath effects, interference, which, in turn, impact positioning accuracy. In addition, telemetry data differs depending on the WiFi device vendor, offering distinct features and formats; use case requirements can also vary widely. Currently, there is no unified model to handle all these variations effectively. In this paper, we present WiFiGPT, a Generative Pretrained Transformer (GPT) based system that is able to handle these variations while achieving high localization accuracy. Our experiments with WiFiGPT demonstrate that GPTs, in particular Large Language Models (LLMs), can effectively capture subtle spatial patterns in noisy wireless telemetry, making them reliable regressors. Compared to existing state-of-the-art methods, our method matches and often surpasses conventional approaches for multiple types of telemetry. Achieving sub-meter accuracy for RSSI and FTM and centimeter-level precision for CSI demonstrates the potential of LLM-based localisation to outperform specialized techniques, all without handcrafted signal processing or calibration.

摘要

基于无线保真（WiFi）的室内定位是无线网络中设备位置确定领域的重要研究方向。精确的室内定位在资产追踪、室内导航等方面具有广泛应用。尽管WiFi定位技术（尤其是利用WiFi遥测的方法）取得了进展，但由于环境变化导致的信号衰减、多径效应、干扰等因素影响定位精度，其实际应用仍受限。此外，不同厂商的WiFi设备提供的遥测数据在特征和格式上存在差异，应用场景需求也各不相同。目前尚缺乏统一模型来有效处理这些变异性。本文提出WiFiGPT系统，该系统基于生成式预训练变换器（GPT），能够在处理这些变异性的同时实现高精度定位。实验表明，GPT（特别是大语言模型LLMs）能有效捕捉噪声无线遥测中的细微空间模式，成为可靠的回归器。与现有先进方法相比，我们的方法在多种遥测类型上达到或超越传统方案：针对RSSI和FTM实现亚米级精度，对CSI达到厘米级精度。这证明基于LLM的定位技术无需人工信号处理或校准即可超越专用技术。

UltraEdit: Training-, Subject-, and Memory-Free Lifelong Editing in Large Language Models

Abstract

arXiv:2505.14679v1 Announce Type: cross Abstract: Lifelong learning enables large language models (LLMs) to adapt to evolving information by continually updating their internal knowledge. An ideal system should support efficient, wide-ranging updates while preserving existing capabilities and ensuring reliable deployment. Model editing stands out as a promising solution for this goal, offering a focused and efficient way to revise a model's internal knowledge. Although recent paradigms have made notable progress, they often struggle to meet the demands of practical lifelong adaptation at scale. To bridge this gap, we propose ULTRAEDIT-a fundamentally new editing solution that is training-, subject- and memory-free, making it particularly well-suited for ultra-scalable, real-world lifelong model editing. ULTRAEDIT performs editing through a self-contained process that relies solely on lightweight linear algebra operations to compute parameter shifts, enabling fast and consistent parameter modifications with minimal overhead. To improve scalability in lifelong settings, ULTRAEDIT employs a lifelong normalization strategy that continuously updates feature statistics across turns, allowing it to adapt to distributional shifts and maintain consistency over time. ULTRAEDIT achieves editing speeds over 7x faster than the previous state-of-the-art method-which was also the fastest known approach-while consuming less than 1/3 the VRAM, making it the only method currently capable of editing a 7B LLM on a 24GB consumer-grade GPU. Furthermore, we construct ULTRAEDITBENCH-the largest dataset in the field to date, with over 2M editing pairs-and demonstrate that our method supports up to 1M edits while maintaining high accuracy. Comprehensive experiments on four datasets and six models show that ULTRAEDIT consistently achieves superior performance across diverse model editing scenarios. Our code is available at: https://github.com/XiaojieGu/UltraEdit.

摘要

终身学习使大语言模型（LLMs）能够通过持续更新内部知识来适应不断变化的信息。理想的系统应支持高效、广泛的更新，同时保留现有能力并确保可靠部署。模型编辑作为实现这一目标的有前景方案脱颖而出，它提供了一种聚焦且高效的内部知识修订方式。尽管现有范式已取得显著进展，但往往难以满足大规模实际终身适应的需求。为填补这一空白，我们提出ULTRAEDIT——一种全新的编辑解决方案，其无需训练、不受主题限制且无需记忆存储，特别适合超大规模的现实世界终身模型编辑。ULTRAEDIT通过自包含流程执行编辑，仅依赖轻量级线性代数运算计算参数偏移，实现快速一致的参数修改且开销极小。为提升终身场景的可扩展性，ULTRAEDIT采用终身归一化策略持续更新跨轮次的特征统计量，使其能适应分布变化并保持长期一致性。ULTRAEDIT的编辑速度较先前最优方法（也是已知最快方法）提升7倍以上，同时VRAM消耗不足其1/3，成为目前唯一能在24GB消费级GPU上编辑70亿参数大模型的方法。此外，我们构建了该领域迄今最大数据集ULTRAEDITBENCH（含超200万编辑对），并证明本方法支持高达100万次编辑仍保持高精度。在四个数据集和六个模型上的全面实验表明，ULTRAEDIT在多样化模型编辑场景中均保持卓越性能。代码已开源：https://github.com/XiaojieGu/UltraEdit。

What Lives? A meta-analysis of diverse opinions on the definition of life

Abstract

arXiv:2505.15849v1 Announce Type: cross Abstract: The question of "what is life?" has challenged scientists and philosophers for centuries, producing an array of definitions that reflect both the mystery of its emergence and the diversity of disciplinary perspectives brought to bear on the question. Despite significant progress in our understanding of biological systems, psychology, computation, and information theory, no single definition for life has yet achieved universal acceptance. This challenge becomes increasingly urgent as advances in synthetic biology, artificial intelligence, and astrobiology challenge our traditional conceptions of what it means to be alive. We undertook a methodological approach that leverages large language models (LLMs) to analyze a set of definitions of life provided by a curated set of cross-disciplinary experts. We used a novel pairwise correlation analysis to map the definitions into distinct feature vectors, followed by agglomerative clustering, intra-cluster semantic analysis, and t-SNE projection to reveal underlying conceptual archetypes. This methodology revealed a continuous landscape of the themes relating to the definition of life, suggesting that what has historically been approached as a binary taxonomic problem should be instead conceived as differentiated perspectives within a unified conceptual latent space. We offer a new methodological bridge between reductionist and holistic approaches to fundamental questions in science and philosophy, demonstrating how computational semantic analysis can reveal conceptual patterns across disciplinary boundaries, and opening similar pathways for addressing other contested definitional territories across the sciences.

摘要

生命是什么？”这一问题几个世纪以来一直挑战着科学家和哲学家，产生了诸多定义，既反映了生命涌现的奥秘，也体现了跨学科视角的多样性。尽管我们在理解生物系统、心理学、计算及信息论方面取得了重大进展，但尚未形成一个被普遍接受的生命定义。随着合成生物学、人工智能和天体生物学的进步不断挑战传统生命概念的边界，这一挑战变得愈发紧迫。我们采用了一种基于大语言模型（LLMs）的方法论，通过分析跨学科专家提供的生命定义集，运用新型成对相关性分析将定义映射为特征向量，继而进行凝聚聚类、簇内语义分析和t-SNE降维投影，以揭示潜在的概念原型。该方法展现出一个连续的生命定义主题图谱，表明这个历史上被视为二元分类学的问题，应被重新理解为统一概念潜在空间中的差异化视角。我们为科学与哲学基础问题的还原论与整体论方法搭建了新的方法论桥梁，证明计算语义分析如何揭示跨学科的概念模式，并为解决科学界其他存在争议的定义领域开辟了类似路径。

AutoData: A Multi-Agent System for Open Web Data Collection

Abstract

arXiv:2505.15859v1 Announce Type: cross Abstract: The exponential growth of data-driven systems and AI technologies has intensified the demand for high-quality web-sourced datasets. While existing datasets have proven valuable, conventional web data collection approaches face significant limitations in terms of human effort and scalability. Current data-collecting solutions fall into two categories: wrapper-based methods that struggle with adaptability and reproducibility, and large language model (LLM)-based approaches that incur substantial computational and financial costs. To address these challenges, we propose AutoData, a novel multi-agent system for Automated web Data collection, that requires minimal human intervention, i.e., only necessitating a natural language instruction specifying the desired dataset. In addition, AutoData is designed with a robust multi-agent architecture, featuring a novel oriented message hypergraph coordinated by a central task manager, to efficiently organize agents across research and development squads. Besides, we introduce a novel hypergraph cache system to advance the multi-agent collaboration process that enables efficient automated data collection and mitigates the token cost issues prevalent in existing LLM-based systems. Moreover, we introduce Instruct2DS, a new benchmark dataset supporting live data collection from web sources across three domains: academic, finance, and sports. Comprehensive evaluations over Instruct2DS and three existing benchmark datasets demonstrate AutoData's superior performance compared to baseline methods. Case studies on challenging tasks such as picture book collection and paper extraction from surveys further validate its applicability. Our source code and dataset are available at https://github.com/GraphResearcher/AutoData.

摘要

数据驱动系统和AI技术的指数级增长加剧了对高质量网络源数据集的需求。尽管现有数据集已证明其价值，但传统网络数据收集方法在人力投入和可扩展性方面存在显著局限。当前数据收集方案分为两类：基于包装器的方法难以适应变化且可复现性差，而基于大语言模型（LLM）的方法则需承担高昂的计算与财务成本。为应对这些挑战，我们提出AutoData——一种新型自动化网络数据收集多智能体系统，仅需自然语言指令指定目标数据集即可运行，极大减少了人工干预。该系统采用鲁棒的多智能体架构，通过中央任务管理器协调的新型定向消息超图，高效组织研发团队中的智能体。此外，我们引入超图缓存系统以优化多智能体协作流程，既能实现高效自动化数据收集，又能缓解现有基于LLM系统的令牌成本问题。同时，我们提出Instruct2DS基准数据集，支持从学术、金融和体育三大领域网络源进行实时数据采集。在Instruct2DS及三个现有基准数据集上的综合评估表明，AutoData性能显著优于基线方法。针对图画书收集和综述文献提取等挑战性任务的案例研究进一步验证了其适用性。源代码与数据集详见https://github.com/GraphResearcher/AutoData。

GRIT: Teaching MLLMs to Think with Images

Abstract

arXiv:2505.15879v1 Announce Type: cross Abstract: Recent studies have demonstrated the efficacy of using Reinforcement Learning (RL) in building reasoning models that articulate chains of thoughts prior to producing final answers. However, despite ongoing advances that aim at enabling reasoning for vision-language tasks, existing open-source visual reasoning models typically generate reasoning content with pure natural language, lacking explicit integration of visual information. This limits their ability to produce clearly articulated and visually grounded reasoning chains. To this end, we propose Grounded Reasoning with Images and Texts (GRIT), a novel method for training MLLMs to think with images. GRIT introduces a grounded reasoning paradigm, in which models generate reasoning chains that interleave natural language and explicit bounding box coordinates. These coordinates point to regions of the input image that the model consults during its reasoning process. Additionally, GRIT is equipped with a reinforcement learning approach, GRPO-GR, built upon the GRPO algorithm. GRPO-GR employs robust rewards focused on the final answer accuracy and format of the grounded reasoning output, which eliminates the need for data with reasoning chain annotations or explicit bounding box labels. As a result, GRIT achieves exceptional data efficiency, requiring as few as 20 image-question-answer triplets from existing datasets. Comprehensive evaluations demonstrate that GRIT effectively trains MLLMs to produce coherent and visually grounded reasoning chains, showing a successful unification of reasoning and grounding abilities.

摘要

近期研究表明，强化学习（RL）在构建推理模型方面具有显著效果，这类模型能在生成最终答案前明确表达思维链。然而，尽管当前研究不断推进视觉-语言任务的推理能力，现有开源视觉推理模型通常仅用纯自然语言生成推理内容，缺乏对视觉信息的显式整合。这导致其难以产生清晰表达且视觉可验证的推理链。为此，我们提出基于图像与文本的 grounded reasoning（GRIT）方法，通过新颖的训练方式使多模态大语言模型（MLLMs）实现图像化思考。GRIT 引入一种 grounded reasoning 范式，要求模型生成交替自然语言与显式边界框坐标的推理链，这些坐标指向模型推理过程中参考的输入图像区域。此外，GRIT 采用基于 GRPO 算法改进的强化学习方法 GRPO-GR，其奖励机制聚焦于最终答案准确性和 grounded reasoning 输出的格式规范，从而无需依赖带有推理链标注或显式边界框标签的数据。这使得 GRIT 具备卓越的数据效率，仅需从现有数据集中获取20个图像-问题-答案三元组即可完成训练。综合评估表明，GRIT 能有效训练 MLLMs 生成连贯且视觉可验证的推理链，成功实现了推理能力与 grounding 能力的统一。

Extracting Probabilistic Knowledge from Large Language Models for Bayesian Network Parameterization

Abstract

arXiv:2505.15918v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated potential as factual knowledge bases; however, their capability to generate probabilistic knowledge about real-world events remains understudied. This paper investigates using probabilistic knowledge inherent in LLMs to derive probability estimates for statements concerning events and their interrelationships captured via a Bayesian Network (BN). Using LLMs in this context allows for the parameterization of BNs, enabling probabilistic modeling within specific domains. Experiments on eighty publicly available Bayesian Networks, from healthcare to finance, demonstrate that querying LLMs about the conditional probabilities of events provides meaningful results when compared to baselines, including random and uniform distributions, as well as approaches based on next-token generation probabilities. We explore how these LLM-derived distributions can serve as expert priors to refine distributions extracted from minimal data, significantly reducing systematic biases. Overall, this work introduces a promising strategy for automatically constructing Bayesian Networks by combining probabilistic knowledge extracted from LLMs with small amounts of real-world data. Additionally, we evaluate several prompting strategies for eliciting probabilistic knowledge from LLMs and establish the first comprehensive baseline for assessing LLM performance in extracting probabilistic knowledge.

摘要

大型语言模型（LLMs）已展现出作为事实性知识库的潜力，但其生成关于现实世界事件的概率性知识的能力仍待深入研究。本文探讨如何利用LLMs内在的概率知识，对通过贝叶斯网络（BN）捕获的事件及其相互关系进行概率估计。在此背景下使用LLMs可实现BN的参数化，从而支持特定领域的概率建模。在涵盖医疗保健至金融等领域的八十个公开贝叶斯网络上进行的实验表明，与随机均匀分布基线及基于下一词生成概率的方法相比，通过LLMs查询事件条件概率可获得有意义的结果。我们进一步探究如何将这些LLM导出的分布作为专家先验，以优化从少量数据中提取的分布，显著减少系统性偏差。总体而言，本研究提出了一种通过结合LLMs提取的概率知识与少量真实数据来自动构建贝叶斯网络的有效策略。此外，我们评估了多种用于从LLMs中提取概率知识的提示策略，并建立了首个评估LLMs在概率知识提取性能方面的综合基线。

Causal Interventions Reveal Shared Structure Across English Filler-Gap Constructions

Abstract

arXiv:2505.16002v1 Announce Type: cross Abstract: Large Language Models (LLMs) have emerged as powerful sources of evidence for linguists seeking to develop theories of syntax. In this paper, we argue that causal interpretability methods, applied to LLMs, can greatly enhance the value of such evidence by helping us characterize the abstract mechanisms that LLMs learn to use. Our empirical focus is a set of English filler-gap dependency constructions (e.g., questions, relative clauses). Linguistic theories largely agree that these constructions share many properties. Using experiments based in Distributed Interchange Interventions, we show that LLMs converge on similar abstract analyses of these constructions. These analyses also reveal previously overlooked factors -- relating to frequency, filler type, and surrounding context -- that could motivate changes to standard linguistic theory. Overall, these results suggest that mechanistic, internal analyses of LLMs can push linguistic theory forward.

摘要

大型语言模型（LLMs）已成为语言学家构建句法理论时的重要证据来源。本文提出，通过对LLMs应用因果可解释性方法，能够通过揭示模型学习的抽象机制显著提升此类证据的价值。我们以英语填充语-空缺依存结构（如疑问句、关系从句）为实证研究对象。语言学理论普遍认为这些结构具有诸多共性。基于分布式互换干预的实验表明，LLMs对这些结构形成了相似的抽象分析。这些分析同时揭示了频率、填充语类型及上下文环境等被传统理论忽视的影响因素，可能推动标准语言学理论的修正。总体而言，研究结果证明对LLMs进行机制性内部分析能够促进语言学理论的发展。

Pre-training Large Memory Language Models with Internal and External Knowledge

Abstract

arXiv:2505.15962v1 Announce Type: cross Abstract: Neural language models are black-boxes -- both linguistic patterns and factual knowledge are distributed across billions of opaque parameters. This entangled encoding makes it difficult to reliably inspect, verify, or update specific facts. We propose a new class of language models, Large Memory Language Models (LMLM) with a pre-training recipe that stores factual knowledge in both internal weights and an external database. Our approach strategically masks externally retrieved factual values from the training loss, thereby teaching the model to perform targeted lookups rather than relying on memorization in model weights. Our experiments demonstrate that LMLMs achieve competitive performance compared to significantly larger, knowledge-dense LLMs on standard benchmarks, while offering the advantages of explicit, editable, and verifiable knowledge bases. This work represents a fundamental shift in how language models interact with and manage factual knowledge.

摘要

神经语言模型是黑箱系统——无论是语言模式还是事实知识，都分布在数十亿个不透明的参数中。这种纠缠的编码方式使得可靠地检查、验证或更新特定事实变得困难。我们提出了一类新型语言模型，即具有大记忆的语言模型（LMLM），其预训练方案将事实知识同时存储在内部权重和外部数据库中。我们的方法策略性地屏蔽了训练损失中从外部检索到的事实值，从而教导模型执行定向查询，而非依赖模型权重的记忆。实验表明，与规模更大、知识密集的大型语言模型（LLM）相比，LMLM在标准基准测试中实现了具有竞争力的性能，同时提供了显式、可编辑和可验证的知识库优势。这项工作代表了语言模型与事实知识交互和管理方式的根本性转变。

VERDI: VLM-Embedded Reasoning for Autonomous Driving

Abstract

arXiv:2505.15925v1 Announce Type: cross Abstract: While autonomous driving (AD) stacks struggle with decision making under partial observability and real-world complexity, human drivers are capable of commonsense reasoning to make near-optimal decisions with limited information. Recent work has attempted to leverage finetuned Vision-Language Models (VLMs) for trajectory planning at inference time to emulate human behavior. Despite their success in benchmark evaluations, these methods are often impractical to deploy (a 70B parameter VLM inference at merely 8 tokens per second requires more than 160G of memory), and their monolithic network structure prohibits safety decomposition. To bridge this gap, we propose VLM-Embedded Reasoning for autonomous Driving (VERDI), a training-time framework that distills the reasoning process and commonsense knowledge of VLMs into the AD stack. VERDI augments modular differentiable end-to-end (e2e) AD models by aligning intermediate module outputs at the perception, prediction, and planning stages with text features explaining the driving reasoning process produced by VLMs. By encouraging alignment in latent space, \textsc{VERDI} enables the modular AD stack to internalize structured reasoning, without incurring the inference-time costs of large VLMs. We demonstrate the effectiveness of our method on the NuScenes dataset and find that VERDI outperforms existing e2e methods that do not embed reasoning by 10% in $\ell_{2}$ distance, while maintaining high inference speed.

摘要

尽管自动驾驶（AD）系统在部分可观测性和现实世界复杂性下的决策面临挑战，人类驾驶员却能够通过常识推理在有限信息下做出近乎最优的决策。近期研究尝试利用微调的视觉语言模型（VLMs）在推理阶段进行轨迹规划以模拟人类行为。尽管这些方法在基准评估中取得了成功，但其部署往往不切实际（一个700亿参数的VLM以每秒仅8个令牌的推理速度需要超过160G内存），且其整体式网络结构阻碍了安全性分解。为弥合这一差距，我们提出用于自动驾驶的VLM嵌入式推理框架（VERDI），该训练时框架将VLMs的推理过程和常识知识蒸馏至AD系统中。VERDI通过将感知、预测和规划阶段的中间模块输出与VLMs生成的驾驶推理过程文本特征对齐，增强了模块化可微分端到端（e2e）AD模型。通过在潜在空间实现对齐，VERDI使模块化AD系统能够内化结构化推理，而无需承担大型VLMs的推理时成本。我们在NuScenes数据集上验证了方法的有效性，发现VERDI在ℓ2距离上优于未嵌入推理的现有e2e方法10%，同时保持较高的推理速度。

Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey

Abstract

arXiv:2505.15957v1 Announce Type: cross Abstract: With advancements in large audio-language models (LALMs), which enhance large language models (LLMs) with auditory capabilities, these models are expected to demonstrate universal proficiency across various auditory tasks. While numerous benchmarks have emerged to assess LALMs' performance, they remain fragmented and lack a structured taxonomy. To bridge this gap, we conduct a comprehensive survey and propose a systematic taxonomy for LALM evaluations, categorizing them into four dimensions based on their objectives: (1) General Auditory Awareness and Processing, (2) Knowledge and Reasoning, (3) Dialogue-oriented Ability, and (4) Fairness, Safety, and Trustworthiness. We provide detailed overviews within each category and highlight challenges in this field, offering insights into promising future directions. To the best of our knowledge, this is the first survey specifically focused on the evaluations of LALMs, providing clear guidelines for the community. We will release the collection of the surveyed papers and actively maintain it to support ongoing advancements in the field.

摘要

随着大型音频-语言模型（LALMs）的发展——这类模型通过增强大型语言模型（LLMs）的听觉能力，被期望在各种听觉任务中展现出普适性能力。尽管已有大量基准测试涌现以评估LALMs的性能，但这些评估仍处于碎片化状态且缺乏系统化的分类体系。为填补这一空白，我们开展了全面调研并提出了一套LALM评估的系统分类法，根据其目标将评估划分为四个维度：（1）通用听觉感知与处理能力，（2）知识与推理能力，（3）对话导向能力，以及（4）公平性、安全性与可信度。我们对每个类别进行了详细概述，并指出了该领域面临的挑战，为未来研究方向提供了前瞻性见解。据我们所知，这是首个专门针对LALMs评估的调研工作，为学界提供了清晰的指导框架。我们将公开所调研文献的汇总集合并持续维护，以支持该领域的持续发展。

Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations

Abstract

arXiv:2505.16004v1 Announce Type: cross Abstract: Sparse autoencoders (SAEs) are commonly used to interpret the internal activations of large language models (LLMs) by mapping them to human-interpretable concept representations. While existing evaluations of SAEs focus on metrics such as the reconstruction-sparsity tradeoff, human (auto-)interpretability, and feature disentanglement, they overlook a critical aspect: the robustness of concept representations to input perturbations. We argue that robustness must be a fundamental consideration for concept representations, reflecting the fidelity of concept labeling. To this end, we formulate robustness quantification as input-space optimization problems and develop a comprehensive evaluation framework featuring realistic scenarios in which adversarial perturbations are crafted to manipulate SAE representations. Empirically, we find that tiny adversarial input perturbations can effectively manipulate concept-based interpretations in most scenarios without notably affecting the outputs of the base LLMs themselves. Overall, our results suggest that SAE concept representations are fragile and may be ill-suited for applications in model monitoring and oversight.

摘要

稀疏自编码器（SAEs）通常用于通过将大型语言模型（LLMs）的内部激活映射到人类可解释的概念表示来解析其工作机制。现有对SAEs的评估主要关注重建-稀疏性权衡、人类（自动）可解释性及特征解耦等指标，却忽视了一个关键维度：概念表示对输入扰动的鲁棒性。我们认为鲁棒性必须作为概念表示的基本考量，因其反映了概念标注的保真度。为此，我们将鲁棒性量化问题建模为输入空间优化问题，并开发了一个包含现实场景的综合评估框架——这些场景中生成的对抗性扰动可操纵SAE的表示。实证研究表明，在大多数情况下，微小的对抗性输入扰动即可有效操纵基于概念的解释，而不会显著影响底层LLM的输出。总体而言，我们的结果表明SAE的概念表示具有脆弱性，可能不适合应用于模型监控与监督场景。

SLMEval: Entropy-Based Calibration for Human-Aligned Evaluation of Large Language Models

Abstract

arXiv:2505.16003v1 Announce Type: cross Abstract: The LLM-as-a-Judge paradigm offers a scalable, reference-free approach for evaluating language models. Although several calibration techniques have been proposed to better align these evaluators with human judgment, prior studies focus primarily on narrow, well-structured benchmarks. As a result, it remains unclear whether such calibrations generalize to real-world, open-ended tasks. In this work, we show that SOTA calibrated evaluators often fail in these settings, exhibiting weak or even negative correlation with human judgments. To address this, we propose SLMEval, a novel and efficient calibration method based on entropy maximization over a small amount of human preference data. By estimating a latent distribution over model quality and reweighting evaluator scores accordingly, SLMEval achieves strong correlation with human evaluations across two real-world production use cases and the public benchmark. For example, on one such task, SLMEval achieves a Spearman correlation of 0.57 with human judgments, while G-Eval yields a negative correlation. In addition, SLMEval reduces evaluation costs by 5-30x compared to GPT-4-based calibrated evaluators such as G-eval.

摘要

LLM-as-a-Judge范式为评估语言模型提供了一种可扩展、无参考的解决方案。尽管已有多种校准技术被提出以更好地使这些评估者与人类判断保持一致，但先前研究主要集中于狭窄、结构化的基准测试。因此，这类校准是否适用于现实世界中开放式的任务仍不明确。本研究显示，当前最先进的校准评估器在此类场景中往往失效，与人类判断呈现弱相关甚至负相关。为此，我们提出SLMEval——一种基于熵最大化的新型高效校准方法，仅需少量人类偏好数据。通过估计模型质量的潜在分布并相应调整评估分数权重，SLMEval在两个现实生产用例和公共基准测试中均实现了与人类评估的强相关性。例如，在某项任务中，SLMEval获得0.57的斯皮尔曼相关系数，而G-Eval则呈现负相关。此外，相较于基于GPT-4的校准评估器（如G-eval），SLMEval将评估成本降低了5至30倍。

Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning

Abstract

arXiv:2505.15966v1 Announce Type: cross Abstract: Chain-of-thought reasoning has significantly improved the performance of Large Language Models (LLMs) across various domains. However, this reasoning process has been confined exclusively to textual space, limiting its effectiveness in visually intensive tasks. To address this limitation, we introduce the concept of reasoning in the pixel-space. Within this novel framework, Vision-Language Models (VLMs) are equipped with a suite of visual reasoning operations, such as zoom-in and select-frame. These operations enable VLMs to directly inspect, interrogate, and infer from visual evidences, thereby enhancing reasoning fidelity for visual tasks. Cultivating such pixel-space reasoning capabilities in VLMs presents notable challenges, including the model's initially imbalanced competence and its reluctance to adopt the newly introduced pixel-space operations. We address these challenges through a two-phase training approach. The first phase employs instruction tuning on synthesized reasoning traces to familiarize the model with the novel visual operations. Following this, a reinforcement learning (RL) phase leverages a curiosity-driven reward scheme to balance exploration between pixel-space reasoning and textual reasoning. With these visual operations, VLMs can interact with complex visual inputs, such as information-rich images or videos to proactively gather necessary information. We demonstrate that this approach significantly improves VLM performance across diverse visual reasoning benchmarks. Our 7B model, \model, achieves 84% on V* bench, 74% on TallyQA-Complex, and 84% on InfographicsVQA, marking the highest accuracy achieved by any open-source model to date. These results highlight the importance of pixel-space reasoning and the effectiveness of our framework.

摘要

思维链推理显著提升了大型语言模型（LLMs）在多个领域的性能表现。然而，该推理过程此前仅局限于文本空间，这限制了其在视觉密集型任务中的有效性。为解决这一局限，我们提出了像素空间推理的新概念。在此创新框架下，视觉语言模型（VLMs）被赋予一系列视觉推理操作（如局部放大和帧选择），使其能够直接对视觉证据进行检视、质询与推断，从而提升视觉任务的推理保真度。培养VLMs的像素空间推理能力面临两大挑战：模型初始能力的不均衡性及对新引入像素空间操作的抵触。我们通过两阶段训练方法应对这些挑战：第一阶段采用合成推理轨迹的指令微调，使模型熟悉新型视觉操作；随后通过强化学习（RL）阶段，利用好奇心驱动的奖励机制平衡像素空间推理与文本推理的探索。借助这些视觉操作，VLMs能够与信息密集的图像或视频等复杂视觉输入进行交互，主动收集必要信息。实验表明，该方法在多种视觉推理基准测试中显著提升了VLM性能。我们的7B参数模型在V* bench上达到84%准确率，TallyQA-Complex达74%，InfographicsVQA达84%，创下当前开源模型的最高精度记录。这些结果印证了像素空间推理的重要性及本框架的有效性。

NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning

Abstract

arXiv:2505.16022v1 Announce Type: cross Abstract: Recent advances such as DeepSeek R1-Zero highlight the effectiveness of incentive training, a reinforcement learning paradigm that computes rewards solely based on the final answer part of a language model's output, thereby encouraging the generation of intermediate reasoning steps. However, these methods fundamentally rely on external verifiers, which limits their applicability to domains like mathematics and coding where such verifiers are readily available. Although reward models can serve as verifiers, they require high-quality annotated data and are costly to train. In this work, we propose NOVER, NO-VERifier Reinforcement Learning, a general reinforcement learning framework that requires only standard supervised fine-tuning data with no need for an external verifier. NOVER enables incentive training across a wide range of text-to-text tasks and outperforms the model of the same size distilled from large reasoning models such as DeepSeek R1 671B by 7.7 percent. Moreover, the flexibility of NOVER enables new possibilities for optimizing large language models, such as inverse incentive training.

摘要

DeepSeek R1-Zero等最新进展凸显了激励训练的有效性，这是一种强化学习范式，其奖励仅基于语言模型输出的最终答案部分进行计算，从而鼓励生成中间推理步骤。然而，这些方法从根本上依赖于外部验证器，限制了其在数学和编程等验证器易于获取的领域的适用性。尽管奖励模型可作为验证器，但它们需要高质量标注数据且训练成本高昂。本研究提出NOVER（无验证器强化学习），这是一种通用强化学习框架，仅需标准监督微调数据而无需外部验证器。NOVER能够在广泛的文本到文本任务中实现激励训练，其性能比从DeepSeek R1 671B等大型推理模型蒸馏出的同规模模型高出7.7%。此外，NOVER的灵活性为优化大语言模型提供了新可能性，例如逆向激励训练。

Merge to Mix: Mixing Datasets via Model Merging

Abstract

arXiv:2505.16066v1 Announce Type: cross Abstract: Mixing datasets for fine-tuning large models (LMs) has become critical for maximizing performance on downstream tasks. However, composing effective dataset mixtures typically relies on heuristics and trial-and-error, often requiring multiple fine-tuning runs to achieve the desired outcome. We propose a novel method, $\textit{Merge to Mix}$ , that accelerates composing dataset mixtures through model merging. Model merging is a recent technique that combines the abilities of multiple individually fine-tuned LMs into a single LM by using a few simple arithmetic operations. Our key insight is that merging models individually fine-tuned on each dataset in a mixture can effectively serve as a surrogate for a model fine-tuned on the entire mixture. Merge to Mix leverages this insight to accelerate selecting dataset mixtures without requiring full fine-tuning on each candidate mixture. Our experiments demonstrate that Merge to Mix surpasses state-of-the-art methods in dataset selection for fine-tuning LMs.

摘要

混合数据集以微调大模型（LMs）已成为提升下游任务性能的关键方法。然而，构建有效的数据集混合通常依赖于启发式方法和试错过程，往往需要多次微调才能达到预期效果。我们提出了一种新方法—— $extit{合并混合法}$ （Merge to Mix），通过模型合并加速数据集混合的构建。模型合并是一种新兴技术，通过简单的算术运算将多个单独微调的LMs能力整合到单一模型中。我们的核心发现是：对混合数据集中每个数据集单独微调的模型进行合并，可有效替代对整个混合数据集微调的模型。合并混合法利用这一发现加速数据集选择，无需对每个候选混合进行完整微调。实验表明，在微调LMs的数据集选择任务中，合并混合法优于当前最先进方法。

Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models

Abstract

arXiv:2505.16056v1 Announce Type: cross Abstract: Mixture-of-Experts (MoE) enables efficient scaling of large language models (LLMs) with sparsely activated experts during inference. To effectively deploy large MoE models on memory-constrained devices, many systems introduce expert offloading that caches a subset of experts in fast memory, leaving others on slow memory to run on CPU or load on demand. While some research has exploited the locality of expert activations, where consecutive tokens activate similar experts, the degree of this local routing consistency varies across models and remains understudied. In this paper, we propose two metrics to measure local routing consistency of MoE models: (1) Segment Routing Best Performance (SRP), which evaluates how well a fixed group of experts can cover the needs of a segment of tokens, and (2) Segment Cache Best Hit Rate (SCH), which measures the optimal segment-level cache hit rate under a given cache size limit. We analyzed 20 MoE LLMs with diverse sizes and architectures and found that models that apply MoE on every layer and do not use shared experts exhibit the highest local routing consistency. We further showed that domain-specialized experts contribute more to routing consistency than vocabulary-specialized ones, and that most models can balance between cache effectiveness and efficiency with cache sizes approximately 2x the active experts. These findings pave the way for memory-efficient MoE design and deployment without compromising inference speed. We publish the code for replicating experiments at https://github.com/ljcleo/moe-lrc .

摘要

混合专家（MoE）技术通过推理过程中稀疏激活专家模块，实现了大语言模型（LLM）的高效扩展。为在内存受限设备上有效部署大型MoE模型，现有系统多采用专家卸载策略——将部分专家缓存于高速内存，其余专家保留在低速内存中通过CPU运行或按需加载。尽管已有研究利用专家激活的局部性（连续token倾向于激活相似专家），但这种局部路由一致性的程度因模型而异且研究不足。本文提出两项指标量化MoE模型的局部路由一致性：(1) 分段路由最优性能（SRP），评估固定专家组覆盖token片段需求的能力；(2) 分段缓存最优命中率（SCH），衡量给定缓存容量限制下的最优分段级缓存命中率。通过对20个不同规模与架构的MoE LLM进行分析，我们发现每层均应用MoE且未使用共享专家的模型表现出最高的局部路由一致性。进一步研究表明：领域专用专家对路由一致性的贡献大于词汇专用专家，且多数模型在缓存容量约为激活专家数2倍时可平衡缓存效率与效果。这些发现为不影响推理速度的内存高效MoE设计与部署提供了理论基础。实验复现代码发布于https://github.com/ljcleo/moe-lrc。

Date Fragments: A Hidden Bottleneck of Tokenization for Temporal Reasoning

Abstract

arXiv:2505.16088v1 Announce Type: cross Abstract: Modern BPE tokenizers often split calendar dates into meaningless fragments, e.g., 20250312 $\rightarrow$ 202, 503, 12, inflating token counts and obscuring the inherent structure needed for robust temporal reasoning. In this work, we (1) introduce a simple yet interpretable metric, termed date fragmentation ratio, that measures how faithfully a tokenizer preserves multi-digit date components; (2) release DateAugBench, a suite of 6500 examples spanning three temporal reasoning tasks: context-based date resolution, format-invariance puzzles, and date arithmetic across historical, contemporary, and future regimes; and (3) through layer-wise probing and causal attention-hop analyses, uncover an emergent date-abstraction mechanism whereby large language models stitch together the fragments of month, day, and year components for temporal reasoning. Our experiments show that excessive fragmentation correlates with accuracy drops of up to 10 points on uncommon dates like historical and futuristic dates. Further, we find that the larger the model, the faster the emergent date abstraction that heals date fragments is accomplished. Lastly, we observe a reasoning path that LLMs follow to assemble date fragments, typically differing from human interpretation (year $\rightarrow$ month $\rightarrow$ day).

摘要

现代BPE分词器常将日期分割为无意义的片段（如20250312→202、503、12），导致标记数量膨胀并破坏稳健时间推理所需的内在结构。本研究提出：（1）一种简单可解释的度量指标——日期碎片化比率，用于评估分词器保留多位数日期成分的保真度；（2）发布DateAugBench基准测试集，包含6500个样本，涵盖基于上下文的日期解析、格式无关难题及跨越历史/当代/未来时期的日期运算三大时序推理任务；（3）通过分层探测与因果注意力跳分析，揭示大语言模型通过拼接年月日碎片进行时序推理的涌现式日期抽象机制。实验表明，过度碎片化会导致历史/未来等非常见日期上的准确率下降达10个百分点。模型规模越大，其修复日期碎片的涌现抽象能力形成越快。研究还发现大模型组装日期碎片的推理路径（年→月→日）通常与人类理解方式存在差异。

Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation

Abstract

arXiv:2505.16146v1 Announce Type: cross Abstract: Large vision-language models (LVLMs) have achieved remarkable performance on multimodal tasks such as visual question answering (VQA) and image captioning. However, they still suffer from hallucinations, generating text inconsistent with visual input, posing significant risks in real-world applications. Existing approaches to address this issue focus on incorporating external knowledge bases, alignment training, or decoding strategies, all of which require substantial computational cost and time. Recent works try to explore more efficient alternatives by adjusting LVLMs' internal representations. Although promising, these methods may cause hallucinations to be insufficiently suppressed or lead to excessive interventions that negatively affect normal semantics. In this work, we leverage sparse autoencoders (SAEs) to identify semantic directions closely associated with either hallucinations or actuality, realizing more precise and direct hallucination-related representations. Our analysis demonstrates that interventions along the faithful direction we identified can mitigate hallucinations, while those along the hallucinatory direction can exacerbate them. Building on these insights, we propose Steering LVLMs via SAE Latent Directions (SSL), a training-free method based on SAE-derived latent directions to mitigate hallucinations in LVLMs. Extensive experiments demonstrate that SSL significantly outperforms existing decoding approaches in mitigating hallucinations, while maintaining transferability across different model architectures with negligible additional time overhead.

摘要

大型视觉语言模型（LVLMs）在多模态任务（如视觉问答和图像描述生成）中展现出卓越性能，但仍存在幻觉问题——生成与视觉输入不一致的文本，这对实际应用构成重大风险。现有解决方法主要依赖外部知识库整合、对齐训练或解码策略，这些方法均需高昂计算成本和时间消耗。近期研究尝试通过调整LVLMs内部表征来探索更高效的替代方案，虽然前景可观，但这些方法可能导致幻觉抑制不足或产生过度干预，进而损害正常语义表达。本研究利用稀疏自编码器（SAEs）识别与幻觉或真实性紧密关联的语义方向，实现更精准直接的幻觉相关表征定位。分析表明，沿我们识别的可信方向进行干预可缓解幻觉，而沿幻觉方向干预则会加剧该现象。基于此，我们提出SAE潜在方向引导法（SSL），这是一种基于SAE派生潜在方向的免训练方法，用于抑制LVLMs的幻觉生成。大量实验证明，SSL在减轻幻觉方面显著优于现有解码方法，同时保持跨模型架构的可迁移性，且附加时间开销可忽略不计。

QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design

Abstract

arXiv:2505.16175v1 Announce Type: cross Abstract: Long-video understanding has emerged as a crucial capability in real-world applications such as video surveillance, meeting summarization, educational lecture analysis, and sports broadcasting. However, it remains computationally prohibitive for VideoLLMs, primarily due to two bottlenecks: 1) sequential video decoding, the process of converting the raw bit stream to RGB frames can take up to a minute for hour-long video inputs, and 2) costly prefilling of up to several million tokens for LLM inference, resulting in high latency and memory use. To address these challenges, we propose QuickVideo, a system-algorithm co-design that substantially accelerates long-video understanding to support real-time downstream applications. It comprises three key innovations: QuickDecoder, a parallelized CPU-based video decoder that achieves 2-3 times speedup by splitting videos into keyframe-aligned intervals processed concurrently; QuickPrefill, a memory-efficient prefilling method using KV-cache pruning to support more frames with less GPU memory; and an overlapping scheme that overlaps CPU video decoding with GPU inference. Together, these components infernece time reduce by a minute on long video inputs, enabling scalable, high-quality video understanding even on limited hardware. Experiments show that QuickVideo generalizes across durations and sampling rates, making long video processing feasible in practice.

摘要

长视频理解在视频监控、会议摘要、教学讲座分析和体育赛事转播等实际应用中已成为关键能力。然而，由于两大瓶颈问题，当前视频大语言模型仍面临巨大计算负担：1）顺序视频解码——将原始比特流转换为RGB帧的过程对小时级视频输入可能耗时长达一分钟；2）高达数百万token的昂贵预填充导致LLM推理延迟高且内存占用大。为解决这些挑战，我们提出QuickVideo系统-算法协同设计方案，通过三大核心创新显著加速长视频理解以支持实时下游应用：QuickDecoder采用基于CPU的并行视频解码器，通过将视频分割为关键帧对齐区间并发处理，实现2-3倍加速；QuickPrefill运用KV缓存剪枝的内存高效预填充方法，以更少GPU内存支持更多帧处理；以及重叠调度方案实现CPU视频解码与GPU推理的并行执行。这些组件共同将长视频输入的推理时间缩短一分钟，使有限硬件条件下仍可进行高质量、可扩展的视频理解。实验表明QuickVideo能适应不同时长和采样率，使长视频处理在实践中具备可行性。

NQKV: A KV Cache Quantization Scheme Based on Normal Distribution Characteristics

Abstract

arXiv:2505.16210v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated remarkable proficiency across a wide range of tasks. However, LLMs often require larger batch sizes to enhance throughput or longer context lengths to meet task demands, which significantly increases the memory resource consumption of the Key-Value (KV) cache during inference, becoming a major bottleneck in LLM deployment. To address this issue, quantization is a common and straightforward approach. Currently, quantization methods for activations are limited to 8-bit, and quantization to even lower bits can lead to substantial accuracy drops. To further save space by quantizing the KV cache to even lower bits, we analyzed the element distribution of the KV cache and designed the NQKV algorithm. Since the elements within each block of the KV cache follow a normal distribution, NQKV employs per-block quantile quantization to achieve information-theoretically optimal quantization error. Without significantly compromising model output quality, NQKV enables the OPT model to perform inference with an 2x larger batch size or a 4x longer context length, and it improves throughput by 9.3x compared to when the KV cache is not used.

摘要

大型语言模型（LLMs）在广泛任务中展现出卓越性能。然而，LLMs通常需要更大批处理量以提升吞吐率，或更长上下文长度以满足任务需求，这显著增加了推理过程中键值（KV）缓存的内存资源消耗，成为LLM部署的主要瓶颈。为解决该问题，量化是一种常见且直接的方法。当前激活函数的量化方法仅限于8位，更低比特的量化会导致精度显著下降。为通过将KV缓存量化至更低比特进一步节省空间，我们分析了KV缓存的元素分布并设计NQKV算法。由于KV缓存每个区块内元素服从正态分布，NQKV采用逐区块分位数量化以实现信息论最优量化误差。在不显著影响模型输出质量的前提下，NQKV使OPT模型能以2倍批处理量或4倍上下文长度进行推理，与未使用KV缓存时相比，吞吐率提升达9.3倍。

Explain Less, Understand More: Jargon Detection via Personalized Parameter-Efficient Fine-tuning

Abstract

arXiv:2505.16227v1 Announce Type: cross Abstract: Personalizing jargon detection and explanation is essential for making technical documents accessible to readers with diverse disciplinary backgrounds. However, tailoring models to individual users typically requires substantial annotation efforts and computational resources due to user-specific finetuning. To address this, we present a systematic study of personalized jargon detection, focusing on methods that are both efficient and scalable for real-world deployment. We explore two personalization strategies: (1) lightweight fine-tuning using Low-Rank Adaptation (LoRA) on open-source models, and (2) personalized prompting, which tailors model behavior at inference time without retaining. To reflect realistic constraints, we also investigate hybrid approaches that combine limited annotated data with unsupervised user background signals. Our personalized LoRA model outperforms GPT-4 by 21.4% in F1 score and exceeds the best performing oracle baseline by 8.3%. Remarkably, our method achieves comparable performance using only 10% of the annotated training data, demonstrating its practicality for resource-constrained settings. Our study offers the first work to systematically explore efficient, low-resource personalization of jargon detection using open-source language models, offering a practical path toward scalable, user-adaptive NLP system.

摘要

个性化术语检测与解释对于使技术文档适应不同学科背景的读者至关重要。然而，针对个体用户定制模型通常需要大量标注工作和计算资源，因为涉及用户特定的微调。为此，我们系统研究了个性化术语检测方法，重点关注实际部署中高效且可扩展的方案。我们探索了两种个性化策略：(1)基于开源模型采用低秩自适应(LoRA)的轻量级微调；(2)无需保留参数的个性化提示方法，在推理阶段调整模型行为。为反映现实约束，我们还研究了将有限标注数据与无监督用户背景信号相结合的混合方法。实验表明，我们的个性化LoRA模型F1分数比GPT-4高出21.4%，较最佳基准模型提升8.3%。值得注意的是，该方法仅需10%标注训练数据即可达到相当性能，证明了其在资源受限场景下的实用性。本研究首次系统探索了基于开源语言模型的高效、低资源个性化术语检测方案，为构建可扩展的用户自适应NLP系统提供了可行路径。

Abstract

arXiv:2505.16192v1 Announce Type: cross Abstract: Recently, reasoning-based MLLMs have achieved a degree of success in generating long-form textual reasoning chains. However, they still struggle with complex tasks that necessitate dynamic and iterative focusing on and revisiting of visual regions to achieve precise grounding of textual reasoning in visual evidence. We introduce \textbf{VLM-R $^3$ } (\textbf{V}isual \textbf{L}anguage \textbf{M}odel with \textbf{R}egion \textbf{R}ecognition and \textbf{R}easoning), a framework that equips an MLLM with the ability to (i) decide \emph{when} additional visual evidence is needed, (ii) determine \emph{where} to ground within the image, and (iii) seamlessly weave the relevant sub-image content back into an interleaved chain-of-thought. The core of our method is \textbf{Region-Conditioned Reinforcement Policy Optimization (R-GRPO)}, a training paradigm that rewards the model for selecting informative regions, formulating appropriate transformations (e.g.\ crop, zoom), and integrating the resulting visual context into subsequent reasoning steps. To bootstrap this policy, we compile a modest but carefully curated Visuo-Lingual Interleaved Rationale (VLIR) corpus that provides step-level supervision on region selection and textual justification. Extensive experiments on MathVista, ScienceQA, and other benchmarks show that VLM-R $^3$ sets a new state of the art in zero-shot and few-shot settings, with the largest gains appearing on questions demanding subtle spatial reasoning or fine-grained visual cue extraction.

摘要

近年来，基于推理的多模态大语言模型（MLLMs）在生成长篇文本推理链方面取得了一定成功。然而，面对需要动态迭代聚焦并重新审视视觉区域以实现文本推理与视觉证据精准对接的复杂任务时，现有模型仍存在不足。我们提出\textbf{VLM-R $^3$ }（具备区域识别与推理能力的视觉语言模型），该框架使MLLM能够：（i）判断\emph{何时}需要补充视觉证据；（ii）确定图像中的\emph{何处}进行定位；（iii）将相关子图像内容无缝编织至交错的思维链中。方法的核心是\textbf{区域条件强化策略优化（R-GRPO）}，该训练范式通过奖励模型选择信息性区域、制定适当变换（如裁剪、缩放）并将生成的视觉上下文整合至后续推理步骤来实现优化。为引导该策略，我们构建了精炼的视觉语言交错理论（VLIR）语料库，提供区域选择与文本论证的步骤级监督。在MathVista、ScienceQA等基准上的大量实验表明，VLM-R $^3$ 在零样本和少样本设置下创造了新的技术标杆，尤其在需要精细空间推理或细粒度视觉线索提取的问题上表现最为突出。

AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models

Abstract

arXiv:2505.16211v1 Announce Type: cross Abstract: The rapid advancement and expanding applications of Audio Large Language Models (ALLMs) demand a rigorous understanding of their trustworthiness. However, systematic research on evaluating these models, particularly concerning risks unique to the audio modality, remains largely unexplored. Existing evaluation frameworks primarily focus on the text modality or address only a restricted set of safety dimensions, failing to adequately account for the unique characteristics and application scenarios inherent to the audio modality. We introduce AudioTrust-the first multifaceted trustworthiness evaluation framework and benchmark specifically designed for ALLMs. AudioTrust facilitates assessments across six key dimensions: fairness, hallucination, safety, privacy, robustness, and authentication. To comprehensively evaluate these dimensions, AudioTrust is structured around 18 distinct experimental setups. Its core is a meticulously constructed dataset of over 4,420 audio/text samples, drawn from real-world scenarios (e.g., daily conversations, emergency calls, voice assistant interactions), specifically designed to probe the multifaceted trustworthiness of ALLMs. For assessment, the benchmark carefully designs 9 audio-specific evaluation metrics, and we employ a large-scale automated pipeline for objective and scalable scoring of model outputs. Experimental results reveal the trustworthiness boundaries and limitations of current state-of-the-art open-source and closed-source ALLMs when confronted with various high-risk audio scenarios, offering valuable insights for the secure and trustworthy deployment of future audio models. Our platform and benchmark are available at https://github.com/JusperLee/AudioTrust.

摘要

音频大语言模型（ALLMs）的快速发展和广泛应用亟需对其可信度进行严格评估。然而，针对此类模型的系统性研究，尤其是涉及音频模态特有风险的评估仍处于空白状态。现有评估框架主要集中于文本模态或仅涵盖有限的安全维度，未能充分考虑音频模态的独特特性和应用场景。本文提出AudioTrust——首个专为ALLMs设计的多元化可信度评估框架与基准测试平台，该框架涵盖公平性、幻觉、安全性、隐私性、鲁棒性和真实性六大核心维度。为实现全面评估，AudioTrust构建了18种实验场景，其核心是基于4,420个真实场景（如日常对话、紧急呼叫、语音助手交互）的音频/文本样本库，专门用于探究ALLMs的多维可信度。评估方面，本基准精心设计了9项音频专用指标，并采用大规模自动化流程对模型输出进行客观可扩展的评分。实验结果表明，当前最先进的开源与闭源ALLMs在面对各类高风险音频场景时存在的可信度边界与局限性，为未来音频模型的安全可信部署提供了重要参考。我们的平台与基准测试已开源：https://github.com/JusperLee/AudioTrust。

DualComp: End-to-End Learning of a Unified Dual-Modality Lossless Compressor

Abstract

arXiv:2505.16256v1 Announce Type: cross Abstract: Most learning-based lossless compressors are designed for a single modality, requiring separate models for multi-modal data and lacking flexibility. However, different modalities vary significantly in format and statistical properties, making it ineffective to use compressors that lack modality-specific adaptations. While multi-modal large language models (MLLMs) offer a potential solution for modality-unified compression, their excessive complexity hinders practical deployment. To address these challenges, we focus on the two most common modalities, image and text, and propose DualComp, the first unified and lightweight learning-based dual-modality lossless compressor. Built on a lightweight backbone, DualComp incorporates three key structural enhancements to handle modality heterogeneity: modality-unified tokenization, modality-switching contextual learning, and modality-routing mixture-of-experts. A reparameterization training strategy is also used to boost compression performance. DualComp integrates both modality-specific and shared parameters for efficient parameter utilization, enabling near real-time inference (200KB/s) on desktop CPUs. With much fewer parameters, DualComp achieves compression performance on par with the SOTA LLM-based methods for both text and image datasets. Its simplified single-modality variant surpasses the previous best image compressor on the Kodak dataset by about 9% using just 1.2% of the model size.

摘要

大多数基于学习的无损压缩方法针对单一模态设计，需为多模态数据建立独立模型且缺乏灵活性。然而不同模态在格式与统计特性上差异显著，缺乏模态适配的压缩器效果欠佳。虽然多模态大语言模型（MLLMs）为模态统一压缩提供了潜在解决方案，但其过高复杂度阻碍了实际部署。为解决这些问题，我们聚焦图像与文本两大常见模态，提出首个统一、轻量化的双模态无损压缩器DualComp。该模型基于轻量级主干网络，通过三项关键结构改进处理模态异质性：模态统一标记化、模态切换上下文学习及模态路由专家混合机制，并采用重参数化训练策略提升压缩性能。DualComp通过模态专用参数与共享参数的高效协同，在桌面CPU上实现近实时推理（200KB/s）。其参数量大幅减少的同时，在文本和图像数据集上的压缩性能与基于LLM的最先进方法相当。其简化单模态变体仅用1.2%的模型尺寸，便在Kodak数据集上以约9%的优势超越此前最佳图像压缩器。

LIFEBench: Evaluating Length Instruction Following in Large Language Models

Abstract

arXiv:2505.16234v1 Announce Type: cross Abstract: While large language models (LLMs) can solve PhD-level reasoning problems over long context inputs, they still struggle with a seemingly simpler task: following explicit length instructions-e.g., write a 10,000-word novel. Additionally, models often generate far too short outputs, terminate prematurely, or even refuse the request. Existing benchmarks focus primarily on evaluating generations quality, but often overlook whether the generations meet length constraints. To this end, we introduce Length Instruction Following Evaluation Benchmark (LIFEBench) to comprehensively evaluate LLMs' ability to follow length instructions across diverse tasks and a wide range of specified lengths. LIFEBench consists of 10,800 instances across 4 task categories in both English and Chinese, covering length constraints ranging from 16 to 8192 words. We evaluate 26 widely-used LLMs and find that most models reasonably follow short-length instructions but deteriorate sharply beyond a certain threshold. Surprisingly, almost all models fail to reach the vendor-claimed maximum output lengths in practice, as further confirmed by our evaluations extending up to 32K words. Even long-context LLMs, despite their extended input-output windows, counterintuitively fail to improve length-instructions following. Notably, Reasoning LLMs outperform even specialized long-text generation models, achieving state-of-the-art length following. Overall, LIFEBench uncovers fundamental limitations in current LLMs' length instructions following ability, offering critical insights for future progress.

摘要

尽管大语言模型（LLMs）能够解决涉及长上下文输入的博士级推理问题，但它们在一个看似更简单的任务上却表现不佳：遵循显式长度指令——例如撰写一篇10,000字的小说。此外，模型生成的输出往往过短、提前终止，甚至直接拒绝请求。现有基准主要评估生成质量，但常常忽略生成内容是否满足长度约束。为此，我们引入了长度指令遵循评估基准（LIFEBench），以全面评估LLMs在不同任务和广泛指定长度范围内遵循长度指令的能力。LIFEBench包含10,800个实例，涵盖4个任务类别，支持中英双语，长度约束范围从16到8192字。我们对26个广泛使用的LLMs进行了评估，发现大多数模型能较好地遵循短长度指令，但超过特定阈值后性能急剧下降。令人惊讶的是，几乎所有模型在实际应用中均未能达到厂商宣称的最大输出长度，这一结论在我们扩展至32K字的评估中得到了进一步验证。即使是长上下文LLMs，尽管其输入输出窗口有所扩展，反直觉地未能提升长度指令遵循能力。值得注意的是，推理型LLMs的表现甚至优于专门的长文本生成模型，实现了最先进的长度指令遵循水平。总体而言，LIFEBench揭示了当前LLMs在长度指令遵循能力上的根本局限，为未来进展提供了关键洞见。

Transformer Copilot: Learning from The Mistake Log in LLM Fine-tuning

Abstract

arXiv:2505.16270v1 Announce Type: cross Abstract: Large language models are typically adapted to downstream tasks through supervised fine-tuning on domain-specific data. While standard fine-tuning focuses on minimizing generation loss to optimize model parameters, we take a deeper step by retaining and leveraging the model's own learning signals, analogous to how human learners reflect on past mistakes to improve future performance. We first introduce the concept of Mistake Log to systematically track the model's learning behavior and recurring errors throughout fine-tuning. Treating the original transformer-based model as the Pilot, we correspondingly design a Copilot model to refine the Pilot's inference performance via logits rectification. We name the overall Pilot-Copilot framework the Transformer Copilot, which introduces (i) a novel Copilot model design, (ii) a joint training paradigm where the Copilot continuously learns from the evolving Mistake Log alongside the Pilot, and (iii) a fused inference paradigm where the Copilot rectifies the Pilot's logits for enhanced generation. We provide both theoretical and empirical analyses on our new learning framework. Experiments on 12 benchmarks spanning commonsense, arithmetic, and recommendation tasks demonstrate that Transformer Copilot consistently improves performance by up to 34.5%, while introducing marginal computational overhead to Pilot models and exhibiting strong scalability and transferability.

摘要

大语言模型通常通过对领域特定数据进行监督微调来适应下游任务。传统微调方法主要关注最小化生成损失以优化模型参数，而我们更进一步：通过保留并利用模型自身的学习信号，模拟人类通过反思过往错误来提升未来表现的学习机制。首先，我们提出"错误日志"概念，用于系统追踪模型在微调过程中的学习行为与重复性错误。将原始基于Transformer的模型视为"领航模型"，相应设计"协航模型"通过logits校正来优化领航模型的推理性能。该整体框架被命名为Transformer协航系统，其创新性体现在：(1)新型协航模型架构；(2)协航模型与领航模型同步训练，持续从动态更新的错误日志中学习的联合训练范式；(3)通过协航模型校正领航模型logits以提升生成质量的融合推理范式。我们对该学习框架进行了理论与实证分析。在涵盖常识推理、算术运算和推荐系统等12个基准测试上的实验表明，Transformer协航系统最高可提升34.5%的性能表现，且仅对领航模型引入边际计算开销，同时展现出优异的可扩展性与迁移能力。

DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving

Abstract

arXiv:2505.16278v1 Announce Type: cross Abstract: End-to-end autonomous driving (E2E-AD) demands effective processing of multi-view sensory data and robust handling of diverse and complex driving scenarios, particularly rare maneuvers such as aggressive turns. Recent success of Mixture-of-Experts (MoE) architecture in Large Language Models (LLMs) demonstrates that specialization of parameters enables strong scalability. In this work, we propose DriveMoE, a novel MoE-based E2E-AD framework, with a Scene-Specialized Vision MoE and a Skill-Specialized Action MoE. DriveMoE is built upon our $\pi_0$ Vision-Language-Action (VLA) baseline (originally from the embodied AI field), called Drive- $\pi_0$ . Specifically, we add Vision MoE to Drive- $\pi_0$ by training a router to select relevant cameras according to the driving context dynamically. This design mirrors human driving cognition, where drivers selectively attend to crucial visual cues rather than exhaustively processing all visual information. In addition, we add Action MoE by training another router to activate specialized expert modules for different driving behaviors. Through explicit behavioral specialization, DriveMoE is able to handle diverse scenarios without suffering from modes averaging like existing models. In Bench2Drive closed-loop evaluation experiments, DriveMoE achieves state-of-the-art (SOTA) performance, demonstrating the effectiveness of combining vision and action MoE in autonomous driving tasks. We will release our code and models of DriveMoE and Drive- $\pi_0$ .

摘要

端到端自动驾驶（E2E-AD）需要有效处理多视角传感数据，并稳健应对多样复杂的驾驶场景，尤其是激进转弯等罕见操作。混合专家（MoE）架构在大型语言模型（LLM）中的成功表明，参数专业化可实现强大扩展性。本研究提出DriveMoE——一种基于MoE的新型E2E-AD框架，包含场景专业化视觉MoE与技能专业化动作MoE。该框架基于我们原有的具身AI领域Vision-Language-Action（VLA）基线模型Drive-π₀构建。具体而言，我们通过训练动态路由网络根据驾驶上下文选择相关摄像头，为Drive-π₀添加视觉MoE模块。该设计模拟人类驾驶认知机制，即驾驶员选择性关注关键视觉线索而非穷尽处理所有视觉信息。此外，我们通过训练另一路由网络激活不同驾驶行为的专用专家模块，构建动作MoE。通过显式的行为专业化设计，DriveMoE能应对多样化场景，避免现有模型的模式平均问题。Bench2Drive闭环评估实验表明，DriveMoE取得最先进（SOTA）性能，验证了视觉与动作MoE组合在自动驾驶任务中的有效性。我们将公开DriveMoE及Drive-π₀的代码与模型。

PMPO: Probabilistic Metric Prompt Optimization for Small and Large Language Models

Abstract

arXiv:2505.16307v1 Announce Type: cross Abstract: Prompt optimization offers a practical and broadly applicable alternative to fine-tuning for improving large language model (LLM) performance. However, existing methods often rely on costly output generation, self-critiquing abilities, or human-annotated preferences, which limit their scalability, especially for smaller or non-instruction-tuned models. We introduce PMPO (Probabilistic Metric Prompt Optimization), a unified framework that refines prompts using token-level cross-entropy loss as a direct, lightweight evaluation signal. PMPO identifies low-quality prompt segments by masking and measuring their impact on loss, then rewrites and selects improved variants by minimizing loss over positive and negative examples. Unlike prior methods, it requires no output sampling or human evaluation during optimization, relying only on forward passes and log-likelihoods. PMPO supports both supervised and preference-based tasks through a closely aligned loss-based evaluation strategy. Experiments show that PMPO consistently outperforms prior methods across model sizes and tasks: it achieves the highest average accuracy on BBH, performs strongly on GSM8K and AQUA-RAT, and improves AlpacaEval 2.0 win rates by over 19 points. These results highlight PMPO's effectiveness, efficiency, and broad applicability.

摘要

提示优化为提高大语言模型（LLM）性能提供了一种实用且广泛适用的替代方案，相较于微调方法。然而，现有技术通常依赖于高成本的输出生成、自我批判能力或人工标注的偏好数据，这限制了其可扩展性，尤其对于较小或未经指令微调的模型。本文提出概率度量提示优化框架PMPO，该框架通过使用词元级交叉熵损失作为直接、轻量级的评估信号来优化提示。PMPO通过掩码处理识别低质量提示片段并量化其对损失函数的影响，随后通过最小化正负样本的损失值来重写和筛选改进版本。与现有方法不同，该技术优化过程中无需输出采样或人工评估，仅需前向传播和对数似然计算。基于损失函数的紧密对齐评估策略，PMPO可同时支持监督学习和偏好导向任务。实验表明，PMPO在不同模型规模和任务中均优于现有方法：在BBH基准上取得最高平均准确率，在GSM8K和AQUA-RAT任务中表现优异，并将AlpacaEval 2.0胜率提升超过19个百分点。这些结果充分证明了PMPO框架的高效性、有效性及广泛适用性。

AdaSTaR: Adaptive Data Sampling for Training Self-Taught Reasoners

Abstract

arXiv:2505.16322v1 Announce Type: cross Abstract: Self-Taught Reasoners (STaR), synonymously known as Rejection sampling Fine-Tuning (RFT), is an integral part of the training pipeline of self-improving reasoning Language Models (LMs). The self-improving mechanism often employs random observation (data) sampling. However, this results in trained observation imbalance; inefficiently over-training on solved examples while under-training on challenging ones. In response, we introduce Adaptive STaR (AdaSTaR), a novel algorithm that rectifies this by integrating two adaptive sampling principles: (1) Adaptive Sampling for Diversity: promoting balanced training across observations, and (2) Adaptive Sampling for Curriculum: dynamically adjusting data difficulty to match the model's evolving strength. Across six benchmarks, AdaSTaR achieves best test accuracy in all instances (6/6) and reduces training FLOPs by an average of 58.6% against an extensive list of baselines. These improvements in performance and efficiency generalize to different pre-trained LMs and larger models, paving the way for more efficient and effective self-improving LMs.

摘要

自教导推理器（STaR），亦称拒绝采样微调（RFT），是自改进推理语言模型（LMs）训练流程的核心组成部分。传统的自改进机制通常采用随机观测（数据）采样，但会导致训练观测不平衡：低效地过度训练已解决的简单样本，而对具有挑战性的样本训练不足。为此，我们提出自适应STaR（AdaSTaR），该创新算法通过整合两项自适应采样原则解决这一问题：（1）多样性自适应采样：促进观测数据的平衡训练；（2）课程自适应采样：动态调整数据难度以匹配模型演化的能力。在六项基准测试中，AdaSTaR在所有案例（6/6）中均取得最佳测试准确率，相较于广泛基线方法平均降低58.6%的训练FLOPs。这些性能与效率的提升可泛化至不同预训练LMs及更大模型，为更高效、更有效的自改进LMs开辟了新路径。

SC4ANM: Identifying Optimal Section Combinations for Automated Novelty Prediction in Academic Papers

Abstract

arXiv:2505.16330v1 Announce Type: cross Abstract: Novelty is a core component of academic papers, and there are multiple perspectives on the assessment of novelty. Existing methods often focus on word or entity combinations, which provide limited insights. The content related to a paper's novelty is typically distributed across different core sections, e.g., Introduction, Methodology and Results. Therefore, exploring the optimal combination of sections for evaluating the novelty of a paper is important for advancing automated novelty assessment. In this paper, we utilize different combinations of sections from academic papers as inputs to drive language models to predict novelty scores. We then analyze the results to determine the optimal section combinations for novelty score prediction. We first employ natural language processing techniques to identify the sectional structure of academic papers, categorizing them into introduction, methods, results, and discussion (IMRaD). Subsequently, we used different combinations of these sections (e.g., introduction and methods) as inputs for pretrained language models (PLMs) and large language models (LLMs), employing novelty scores provided by human expert reviewers as ground truth labels to obtain prediction results. The results indicate that using introduction, results and discussion is most appropriate for assessing the novelty of a paper, while the use of the entire text does not yield significant results. Furthermore, based on the results of the PLMs and LLMs, the introduction and results appear to be the most important section for the task of novelty score prediction. The code and dataset for this paper can be accessed at https://github.com/njust-winchy/SC4ANM.

摘要

新颖性是学术论文的核心要素，其评估存在多种视角。现有方法多聚焦于词语或实体组合，但提供的见解有限。与论文新颖性相关的内容通常分布于不同核心章节（如引言、方法与结果），因此探索评估论文新颖性的最优章节组合对推进自动化新颖性评估具有重要意义。本文采用学术论文不同章节组合作为输入驱动语言模型预测新颖性评分，通过结果分析确定最优章节组合方案。我们首先运用自然语言处理技术识别论文章节结构（分为引言、方法、结果与讨论的IMRaD结构），随后以不同章节组合（如引言+方法）作为预训练语言模型（PLMs）和大语言模型（LLMs）的输入，以专家评审提供的新颖性评分为真实标签获取预测结果。研究表明，采用引言、结果与讨论三部分的组合最适合评估论文新颖性，而全文使用并未产生显著效果。此外，基于PLMs和LLMs的实验结果表明，引言和结果两个章节在新颖性评分预测任务中最为重要。本文代码与数据集详见https://github.com/njust-winchy/SC4ANM。

AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training

Abstract

arXiv:2505.16363v1 Announce Type: cross Abstract: We introduce AdamS, a simple yet effective alternative to Adam for large language model (LLM) pretraining and post-training. By leveraging a novel denominator, i.e., the root of weighted sum of squares of the momentum and the current gradient, AdamS eliminates the need for second-moment estimates. Hence, AdamS is efficient, matching the memory and compute footprint of SGD with momentum while delivering superior optimization performance. Moreover, AdamS is easy to adopt: it can directly inherit hyperparameters of AdamW, and is entirely model-agnostic, integrating seamlessly into existing pipelines without modifications to optimizer APIs or architectures. The motivation behind AdamS stems from the observed $(L_0, L_1)$ smoothness properties in transformer objectives, where local smoothness is governed by gradient magnitudes that can be further approximated by momentum magnitudes. We establish rigorous theoretical convergence guarantees and provide practical guidelines for hyperparameter selection. Empirically, AdamS demonstrates strong performance in various tasks, including pre-training runs on GPT-2 and Llama2 (up to 13B parameters) and reinforcement learning in post-training regimes. With its efficiency, simplicity, and theoretical grounding, AdamS stands as a compelling alternative to existing optimizers.

摘要

我们提出AdamS——一种简单而有效的优化器替代方案，适用于大规模语言模型（LLM）的预训练与后训练场景。该方法通过采用新颖的分母项（即动量与当前梯度平方加权和的平方根），消除了对二阶矩估计的需求。因此AdamS具有高效特性，在保持与带动量随机梯度下降（SGD）相同内存和计算开销的同时，提供了更优的优化性能。该方案具备即插即用特性：可直接继承AdamW的超参数设置，且完全与模型无关，无需修改优化器API或架构即可无缝集成至现有流程。AdamS的设计动机源于Transformer目标函数中观测到的 $(L_0, L_1)$ 平滑特性，其中局部平滑度由梯度幅值决定，而该幅值可进一步通过动量幅值近似。我们建立了严格的理论收敛保证，并提供了超参数选择的实践指南。实验表明，AdamS在多项任务中表现优异，包括GPT-2和Llama2（最高130亿参数）的预训练，以及后训练阶段的强化学习。凭借其高效性、简洁性和理论完备性，AdamS成为现有优化器的有力替代方案。

Resource for Error Analysis in Text Simplification: New Taxonomy and Test Collection

Abstract

arXiv:2505.16392v1 Announce Type: cross Abstract: The general public often encounters complex texts but does not have the time or expertise to fully understand them, leading to the spread of misinformation. Automatic Text Simplification (ATS) helps make information more accessible, but its evaluation methods have not kept up with advances in text generation, especially with Large Language Models (LLMs). In particular, recent studies have shown that current ATS metrics do not correlate with the presence of errors. Manual inspections have further revealed a variety of errors, underscoring the need for a more nuanced evaluation framework, which is currently lacking. This resource paper addresses this gap by introducing a test collection for detecting and classifying errors in simplified texts. First, we propose a taxonomy of errors, with a formal focus on information distortion. Next, we introduce a parallel dataset of automatically simplified scientific texts. This dataset has been human-annotated with labels based on our proposed taxonomy. Finally, we analyze the quality of the dataset, and we study the performance of existing models to detect and classify errors from that taxonomy. These contributions give researchers the tools to better evaluate errors in ATS, develop more reliable models, and ultimately improve the quality of automatically simplified texts.

摘要

公众常接触复杂文本却因时间或专业限制难以充分理解，导致错误信息传播。自动文本简化(ATS)技术虽能提升信息可及性，但其评估方法未能跟上文本生成技术的进步，尤其在大语言模型(LLMs)时代更为凸显。最新研究表明，现有ATS评估指标与错误出现率缺乏相关性。人工检查进一步揭示了多样化的错误类型，这凸显出现有评估框架缺乏对错误的精细分类能力。本资源论文通过构建简化文本错误检测与分类测试集来填补这一空白。首先，我们提出以信息失真为核心的形式化错误分类体系；其次，引入经自动简化的平行科学文本数据集，该数据集已基于我们的分类体系进行人工标注；最后，我们分析数据集质量，并评估现有模型在该分类体系下的错误检测与分类性能。这些成果为研究者提供了更完善的ATS错误评估工具，有助于开发更可靠的模型，最终提升自动简化文本的质量。

Attributing Response to Context: A Jensen-Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation

Abstract

arXiv:2505.16415v1 Announce Type: cross Abstract: Retrieval-Augmented Generation (RAG) leverages large language models (LLMs) combined with external contexts to enhance the accuracy and reliability of generated responses. However, reliably attributing generated content to specific context segments, context attribution, remains challenging due to the computationally intensive nature of current methods, which often require extensive fine-tuning or human annotation. In this work, we introduce a novel Jensen-Shannon Divergence driven method to Attribute Response to Context (ARC-JSD), enabling efficient and accurate identification of essential context sentences without additional fine-tuning or surrogate modelling. Evaluations on a wide range of RAG benchmarks, such as TyDi QA, Hotpot QA, and Musique, using instruction-tuned LLMs in different scales demonstrate superior accuracy and significant computational efficiency improvements compared to the previous surrogate-based method. Furthermore, our mechanistic analysis reveals specific attention heads and multilayer perceptron (MLP) layers responsible for context attribution, providing valuable insights into the internal workings of RAG models.

摘要

检索增强生成（RAG）通过将大语言模型（LLMs）与外部上下文结合，提升了生成响应的准确性与可靠性。然而，由于现有方法计算成本高昂（通常需要大量微调或人工标注），如何可靠地将生成内容归因于特定上下文片段（即上下文归因）仍具挑战性。本研究提出了一种基于Jensen-Shannon散度的新型上下文归因方法（ARC-JSD），无需额外微调或代理建模即可高效精准地识别关键上下文句子。通过在TyDi QA、Hotpot QA和Musique等多种RAG基准测试中使用不同规模的指令调优LLMs进行评估，本方法相较于先前基于代理的方法展现出更高的准确性及显著的计算效率提升。此外，机制分析揭示了负责上下文归因的特定注意力头和多层感知机（MLP）层，为理解RAG模型的内部工作机制提供了重要见解。

SATURN: SAT-based Reinforcement Learning to Unleash Language Model Reasoning

Abstract

arXiv:2505.16368v1 Announce Type: cross Abstract: How to design reinforcement learning (RL) tasks that effectively unleash the reasoning capability of large language models (LLMs) remains an open question. Existing RL tasks (e.g., math, programming, and constructing reasoning tasks) suffer from three key limitations: (1) Scalability. They rely heavily on human annotation or expensive LLM synthesis to generate sufficient training data. (2) Verifiability. LLMs' outputs are hard to verify automatically and reliably. (3) Controllable Difficulty. Most tasks lack fine-grained difficulty control, making it hard to train LLMs to develop reasoning ability from easy to hard. To address these limitations, we propose Saturn, a SAT-based RL framework that uses Boolean Satisfiability (SAT) problems to train and evaluate LLM reasoning. Saturn enables scalable task construction, rule-based verification, and precise difficulty control. Saturn designs a curriculum learning pipeline that continuously improves LLMs' reasoning capability by constructing SAT tasks of increasing difficulty and training LLMs from easy to hard. To ensure stable training, we design a principled mechanism to control difficulty transitions. We introduce Saturn-2.6k, a dataset of 2,660 SAT problems with varying difficulty. It supports the evaluation of how LLM reasoning changes with problem difficulty. We apply Saturn to DeepSeek-R1-Distill-Qwen and obtain Saturn-1.5B and Saturn-7B. We achieve several notable results: (1) On SAT problems, Saturn-1.5B and Saturn-7B achieve average pass@3 improvements of +14.0 and +28.1, respectively. (2) On math and programming tasks, Saturn-1.5B and Saturn-7B improve average scores by +4.9 and +1.8 on benchmarks (e.g., AIME, LiveCodeBench). (3) Compared to the state-of-the-art (SOTA) approach in constructing RL tasks, Saturn achieves further improvements of +8.8%. We release the source code, data, and models to support future research.

摘要

如何设计能有效释放大语言模型（LLM）推理能力的强化学习（RL）任务仍是一个开放性问题。现有RL任务（如数学、编程和构建推理任务）存在三个关键缺陷：（1）可扩展性。它们严重依赖人工标注或昂贵的LLM合成来生成足够训练数据。（2）可验证性。LLM的输出难以自动可靠地验证。（3）难度可控性。大多数任务缺乏细粒度难度控制，难以实现LLM从易到难的推理能力培养。

为此，我们提出Saturn——基于布尔可满足性问题（SAT）的RL框架，通过SAT问题训练和评估LLM推理。Saturn支持可扩展的任务构建、基于规则的验证和精确难度控制。该框架设计了课程学习流程，通过构建难度递增的SAT任务，实现LLM从易到难的推理能力持续提升。为确保训练稳定性，我们设计了控制难度迁移的原则性机制。

我们发布Saturn-2.6k数据集，包含2,660个不同难度的SAT问题，支持评估LLM推理能力随问题难度的变化规律。将Saturn应用于DeepSeek-R1-Distill-Qwen后，我们获得Saturn-1.5B和Saturn-7B模型，取得以下成果：（1）在SAT问题上，二者pass@3指标分别平均提升+14.0和+28.1；（2）在数学和编程任务中，于AIME、LiveCodeBench等基准测试平均分分别提升+4.9和+1.8；（3）相比当前最先进的RL任务构建方法，Saturn实现额外+8.8%的提升。我们公开源代码、数据及模型以支持后续研究。

Sparse Activation Editing for Reliable Instruction Following in Narratives

Abstract

arXiv:2505.16505v1 Announce Type: cross Abstract: Complex narrative contexts often challenge language models' ability to follow instructions, and existing benchmarks fail to capture these difficulties. To address this, we propose Concise-SAE, a training-free framework that improves instruction following by identifying and editing instruction-relevant neurons using only natural language instructions, without requiring labelled data. To thoroughly evaluate our method, we introduce FreeInstruct, a diverse and realistic benchmark of 1,212 examples that highlights the challenges of instruction following in narrative-rich settings. While initially motivated by complex narratives, Concise-SAE demonstrates state-of-the-art instruction adherence across varied tasks without compromising generation quality.

摘要

复杂叙事语境常常挑战语言模型遵循指令的能力，而现有基准测试未能捕捉这些困难。为此，我们提出Concise-SAE框架——一种无需训练的方法，仅通过自然语言指令即可识别并编辑与指令相关的神经元，无需标注数据即可提升指令遵循性能。为全面评估该方法，我们构建了FreeInstruct基准测试，包含1,212个多样化真实案例，突出展现叙事丰富场景中指令遵循的挑战。虽然最初针对复杂叙事设计，但Concise-SAE在各类任务中均展现出最先进的指令遵循能力，且不影响生成质量。

AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning

Abstract

arXiv:2505.16400v1 Announce Type: cross Abstract: Despite recent progress in large-scale reinforcement learning (RL) for reasoning, the training recipe for building high-performing reasoning models remains elusive. Key implementation details of frontier models, such as DeepSeek-R1, including data curation strategies and RL training recipe, are often omitted. Moreover, recent research indicates distillation remains more effective than RL for smaller models. In this work, we demonstrate that large-scale RL can significantly enhance the reasoning capabilities of strong, small- and mid-sized models, achieving results that surpass those of state-of-the-art distillation-based models. We systematically study the RL training process through extensive ablations and propose a simple yet effective approach: first training on math-only prompts, then on code-only prompts. Notably, we find that math-only RL not only significantly enhances the performance of strong distilled models on math benchmarks (e.g., +14.6% / +17.2% on AIME 2025 for the 7B / 14B models), but also code reasoning tasks (e.g., +6.8% / +5.8% on LiveCodeBench for the 7B / 14B models). In addition, extended code-only RL iterations further improve performance on code benchmarks with minimal or no degradation in math results. We develop a robust data curation pipeline to collect challenging prompts with high-quality, verifiable answers and test cases to enable verification-based RL across both domains. Finally, we identify key experimental insights, including curriculum learning with progressively increasing response lengths and the stabilizing effect of on-policy parameter updates. We find that RL not only elicits the foundational reasoning capabilities acquired during pretraining and supervised fine-tuning (e.g., distillation), but also pushes the limits of the model's reasoning ability, enabling it to solve problems that were previously unsolvable.

摘要

尽管大规模强化学习（RL）在推理领域取得进展，但构建高性能推理模型的训练方案仍不明确。前沿模型（如DeepSeek-R1）的关键实现细节——包括数据筛选策略和RL训练方案——常被忽略。此外，近期研究表明对于较小模型，蒸馏法仍比RL更有效。本研究证明，大规模RL能显著增强中小型强模型的推理能力，其效果超越基于蒸馏的最先进模型。我们通过大量消融实验系统研究RL训练过程，提出一种简单有效的方法：先在纯数学提示上训练，再在纯代码提示上训练。值得注意的是，纯数学RL不仅显著提升强蒸馏模型在数学基准上的表现（例如7B/14B模型在AIME 2025上分别提升14.6%/17.2%），还能提升代码推理任务表现（例如7B/14B模型在LiveCodeBench上分别提升6.8%/5.8%）。此外，延长纯代码RL训练可进一步提升代码基准性能，同时数学结果仅有微小下降或保持稳定。我们开发了稳健的数据筛选流程，用于收集具有高质量可验证答案和测试用例的挑战性提示，以实现跨领域的基于验证的RL。最后，我们发现了关键实验洞见，包括响应长度渐进增加的课程学习策略和同策略参数更新的稳定效果。研究表明，RL不仅能激发预训练和监督微调（如蒸馏）中获得的基础推理能力，更能突破模型原有推理极限，使其解决此前无法解决的问题。

Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning

Abstract

arXiv:2505.16410v1 Announce Type: cross Abstract: Recently, large language models (LLMs) have shown remarkable reasoning capabilities via large-scale reinforcement learning (RL). However, leveraging the RL algorithm to empower effective multi-tool collaborative reasoning in LLMs remains an open challenge. In this paper, we introduce Tool-Star, an RL-based framework designed to empower LLMs to autonomously invoke multiple external tools during stepwise reasoning. Tool-Star integrates six types of tools and incorporates systematic designs in both data synthesis and training. To address the scarcity of tool-use data, we propose a general tool-integrated reasoning data synthesis pipeline, which combines tool-integrated prompting with hint-based sampling to automatically and scalably generate tool-use trajectories. A subsequent quality normalization and difficulty-aware classification process filters out low-quality samples and organizes the dataset from easy to hard. Furthermore, we propose a two-stage training framework to enhance multi-tool collaborative reasoning by: (1) cold-start fine-tuning, which guides LLMs to explore reasoning patterns via tool-invocation feedback; and (2) a multi-tool self-critic RL algorithm with hierarchical reward design, which reinforces reward understanding and promotes effective tool collaboration. Experimental analyses on over 10 challenging reasoning benchmarks highlight the effectiveness and efficiency of Tool-Star. The code is available at https://github.com/dongguanting/Tool-Star.

摘要

近期，大规模语言模型（LLMs）通过大规模强化学习（RL）展现出卓越的推理能力。然而，如何利用RL算法实现LLMs中多工具协同推理的有效赋能仍是一个开放性问题。本文提出Tool-Star——一个基于RL的框架，旨在使LLMs能够在逐步推理过程中自主调用多个外部工具。该框架整合了六类工具，并在数据合成与训练中采用系统性设计。针对工具使用数据稀缺的问题，我们提出通用工具集成推理数据合成流程，通过工具集成提示与基于提示的采样相结合，实现自动化、可扩展的工具使用轨迹生成。后续的质量归一化与难度感知分类流程可过滤低质量样本，并将数据集按难度由易至难组织。此外，我们提出两阶段训练框架以增强多工具协同推理能力：（1）冷启动微调阶段，通过工具调用反馈引导LLMs探索推理模式；（2）采用分层奖励设计的"多工具自批判"RL算法，强化奖励理解并促进有效工具协作。在超过10个高难度推理基准上的实验分析验证了Tool-Star的有效性与高效性。代码已开源：https://github.com/dongguanting/Tool-Star。

Abstract

arXiv:2505.16498v1 Announce Type: cross Abstract: Achieving full automation in self-driving vehicles remains a challenge, especially in dynamic urban environments where navigation requires real-time adaptability. Existing systems struggle to handle navigation plans when faced with unpredictable changes in road layouts, spontaneous detours, or missing map data, due to their heavy reliance on predefined cartographic information. In this work, we explore the use of Large Language Models to generate Answer Set Programming rules by translating informal navigation instructions into structured, logic-based reasoning. ASP provides non-monotonic reasoning, allowing autonomous vehicles to adapt to evolving scenarios without relying on predefined maps. We present an experimental evaluation in which LLMs generate ASP constraints that encode real-world urban driving logic into a formal knowledge representation. By automating the translation of informal navigation instructions into logical rules, our method improves adaptability and explainability in autonomous navigation. Results show that LLM-driven ASP rule generation supports semantic-based decision-making, offering an explainable framework for dynamic navigation planning that aligns closely with how humans communicate navigational intent.

摘要

实现自动驾驶车辆的完全自动化仍面临挑战，尤其在动态城市环境中，导航需要实时适应能力。现有系统由于高度依赖预定义地图信息，在遇到道路布局不可预测变化、突发绕行或地图数据缺失时难以处理导航规划。本研究探索利用大型语言模型将非正式导航指令转化为基于逻辑的结构化推理，从而生成答案集编程规则。ASP提供的非单调推理能力使自动驾驶车辆无需依赖预设地图即可适应动态场景。我们通过实验评估表明，LLMs生成的ASP约束能将真实城市驾驶逻辑编码为形式化知识表示。通过自动化转换非正式导航指令为逻辑规则，本方法提升了自主导航的适应性与可解释性。结果表明，LLM驱动的ASP规则生成支持基于语义的决策制定，为动态导航规划提供了与人类导航意图表达高度契合的可解释框架。

LLaMAs Have Feelings Too: Unveiling Sentiment and Emotion Representations in LLaMA Models Through Probing

Abstract

arXiv:2505.16491v1 Announce Type: cross Abstract: Large Language Models (LLMs) have rapidly become central to NLP, demonstrating their ability to adapt to various tasks through prompting techniques, including sentiment analysis. However, we still have a limited understanding of how these models capture sentiment-related information. This study probes the hidden layers of Llama models to pinpoint where sentiment features are most represented and to assess how this affects sentiment analysis. Using probe classifiers, we analyze sentiment encoding across layers and scales, identifying the layers and pooling methods that best capture sentiment signals. Our results show that sentiment information is most concentrated in mid-layers for binary polarity tasks, with detection accuracy increasing up to 14% over prompting techniques. Additionally, we find that in decoder-only models, the last token is not consistently the most informative for sentiment encoding. Finally, this approach enables sentiment tasks to be performed with memory requirements reduced by an average of 57%. These insights contribute to a broader understanding of sentiment in LLMs, suggesting layer-specific probing as an effective approach for sentiment tasks beyond prompting, with potential to enhance model utility and reduce memory requirements.

摘要

大型语言模型（LLMs）已迅速成为自然语言处理的核心，通过提示技术（包括情感分析）展示了其适应各种任务的能力。然而，我们对这些模型如何捕捉情感相关信息仍知之甚少。本研究探究了Llama模型的隐藏层，以确定情感特征最集中的位置，并评估其对情感分析的影响。通过探针分类器，我们分析了不同层和规模下的情感编码，识别出最能捕捉情感信号的层和池化方法。结果表明，在二元极性任务中，情感信息最集中在中层，检测准确率较提示技术最高可提升14%。此外，我们发现仅解码器模型中，最后一个标记并非始终是情感编码信息量最大的部分。最终，该方法使情感任务的内存需求平均降低57%。这些发现深化了对LLMs中情感机制的理解，提出层特异性探针可作为超越提示技术的有效情感任务处理方案，并具备提升模型效用和降低内存需求的潜力。

Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning

Abstract

arXiv:2505.16483v1 Announce Type: cross Abstract: Teaching large language models (LLMs) to be faithful in the provided context is crucial for building reliable information-seeking systems. Therefore, we propose a systematic framework, CANOE, to improve the faithfulness of LLMs in both short-form and long-form generation tasks without human annotations. Specifically, we first synthesize short-form question-answering (QA) data with four diverse tasks to construct high-quality and easily verifiable training data without human annotation. Also, we propose Dual-GRPO, a rule-based reinforcement learning method that includes three tailored rule-based rewards derived from synthesized short-form QA data, while simultaneously optimizing both short-form and long-form response generation. Notably, Dual-GRPO eliminates the need to manually label preference data to train reward models and avoids over-optimizing short-form generation when relying only on the synthesized short-form QA data. Experimental results show that CANOE greatly improves the faithfulness of LLMs across 11 different downstream tasks, even outperforming the most advanced LLMs, e.g., GPT-4o and OpenAI o1.

摘要

教导大型语言模型（LLM）在给定上下文中保持忠实性，对于构建可靠的信息检索系统至关重要。为此，我们提出了一个系统化框架CANOE，旨在无需人工标注的情况下提升LLM在短文本和长文本生成任务中的忠实性。具体而言，我们首先通过四项多样化任务合成短文本问答（QA）数据，从而构建高质量且易于验证的无标注训练数据。此外，我们提出了Dual-GRPO——一种基于规则的强化学习方法，该方法包含三种源自合成短文本QA数据的定制化规则奖励，同时优化短文本和长文本响应生成。值得注意的是，Dual-GRPO无需手动标注偏好数据来训练奖励模型，也避免了仅依赖合成短文本QA数据时对短文本生成的过度优化。实验结果表明，CANOE在11项不同下游任务中显著提升了LLM的忠实性，其表现甚至超越了最先进的LLM（如GPT-4o和OpenAI o1）。

Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models

Abstract

arXiv:2505.16416v1 Announce Type: cross Abstract: Rotary Position Embedding (RoPE) is a widely adopted technique for encoding relative positional information in large language models (LLMs). However, when extended to large vision-language models (LVLMs), its variants introduce unintended cross-modal positional biases. Specifically, they enforce relative positional dependencies between text token indices and image tokens, causing spurious alignments. This issue arises because image tokens representing the same content but located at different spatial positions are assigned distinct positional biases, leading to inconsistent cross-modal associations. To address this, we propose Per-Token Distance (PTD) - a simple yet effective metric for quantifying the independence of positional encodings across modalities. Informed by this analysis, we introduce Circle-RoPE, a novel encoding scheme that maps image token indices onto a circular trajectory orthogonal to the linear path of text token indices, forming a cone-like structure. This configuration ensures that each text token maintains an equal distance to all image tokens, reducing artificial cross-modal biases while preserving intra-image spatial information. To further enhance performance, we propose a staggered layer strategy that applies different RoPE variants across layers. This design leverages the complementary strengths of each RoPE variant, thereby enhancing the model's overall performance. Our experimental results demonstrate that our method effectively preserves spatial information from images while reducing relative positional bias, offering a more robust and flexible positional encoding framework for LVLMs. The code is available at https://github.com/lose4578/CircleRoPE.

摘要

旋转位置编码（RoPE）是大语言模型（LLMs）中广泛采用的相对位置信息编码技术。然而当扩展至大视觉语言模型（LVLMs）时，其变体会引入非预期的跨模态位置偏差。具体表现为：这些变体会强制建立文本标记索引与图像标记之间的相对位置依赖关系，从而导致虚假对齐。该问题的根源在于，代表相同内容但位于不同空间位置的图像标记会被赋予不同的位置偏差，最终产生不一致的跨模态关联。为解决这一问题，我们提出"单标记距离"（PTD）——一种简单有效的量化跨模态位置编码独立性的指标。基于此分析，我们提出Circle-RoPE编码方案：将图像标记索引映射到与文本标记索引线性轨迹正交的圆形路径上，形成锥形结构。这种配置确保每个文本标记与所有图像标记保持等距，在保留图像内空间信息的同时减少人为跨模态偏差。为进一步提升性能，我们提出交错层策略——在不同网络层应用不同的RoPE变体。该设计能充分发挥各RoPE变体的互补优势，从而提升模型整体性能。实验结果表明，我们的方法在有效保留图像空间信息的同时降低了相对位置偏差，为LVLMs提供了更鲁棒、更灵活的位置编码框架。代码已开源于https://github.com/lose4578/CircleRoPE。

Benchmarking and Pushing the Multi-Bias Elimination Boundary of LLMs via Causal Effect Estimation-guided Debiasing

Abstract

arXiv:2505.16522v1 Announce Type: cross Abstract: Despite significant progress, recent studies have indicated that current large language models (LLMs) may still utilize bias during inference, leading to the poor generalizability of LLMs. Some benchmarks are proposed to investigate the generalizability of LLMs, with each piece of data typically containing one type of controlled bias. However, a single piece of data may contain multiple types of biases in practical applications. To bridge this gap, we propose a multi-bias benchmark where each piece of data contains five types of biases. The evaluations conducted on this benchmark reveal that the performance of existing LLMs and debiasing methods is unsatisfying, highlighting the challenge of eliminating multiple types of biases simultaneously. To overcome this challenge, we propose a causal effect estimation-guided multi-bias elimination method (CMBE). This method first estimates the causal effect of multiple types of biases simultaneously. Subsequently, we eliminate the causal effect of biases from the total causal effect exerted by both the semantic information and biases during inference. Experimental results show that CMBE can effectively eliminate multiple types of bias simultaneously to enhance the generalizability of LLMs.

摘要

尽管取得了显著进展，近期研究表明当前大规模语言模型（LLM）在推理过程中仍可能利用偏见，导致模型泛化能力较差。现有研究提出了一些基准来考察LLM的泛化能力，其中每条数据通常仅包含一种受控偏见类型。然而在实际应用中，单条数据可能同时存在多种偏见类型。为填补这一空白，我们提出了一个多偏见基准数据集，其中每条数据包含五种偏见类型。在该基准上的评估表明，现有LLM及去偏见方法的性能表现欠佳，这凸显了同时消除多种偏见类型的挑战性。为解决这一难题，我们提出了一种因果效应估计引导的多偏见消除方法（CMBE）。该方法首先同步估计多种偏见类型的因果效应，随后在推理过程中从语义信息和偏见共同产生的总因果效应中消除偏见的因果影响。实验结果表明，CMBE能有效同步消除多种偏见类型，从而提升LLM的泛化能力。

Are the Hidden States Hiding Something? Testing the Limits of Factuality-Encoding Capabilities in LLMs

Abstract

arXiv:2505.16520v1 Announce Type: cross Abstract: Factual hallucinations are a major challenge for Large Language Models (LLMs). They undermine reliability and user trust by generating inaccurate or fabricated content. Recent studies suggest that when generating false statements, the internal states of LLMs encode information about truthfulness. However, these studies often rely on synthetic datasets that lack realism, which limits generalization when evaluating the factual accuracy of text generated by the model itself. In this paper, we challenge the findings of previous work by investigating truthfulness encoding capabilities, leading to the generation of a more realistic and challenging dataset. Specifically, we extend previous work by introducing: (1) a strategy for sampling plausible true-false factoid sentences from tabular data and (2) a procedure for generating realistic, LLM-dependent true-false datasets from Question Answering collections. Our analysis of two open-source LLMs reveals that while the findings from previous studies are partially validated, generalization to LLM-generated datasets remains challenging. This study lays the groundwork for future research on factuality in LLMs and offers practical guidelines for more effective evaluation.

摘要

事实性幻觉是大语言模型(LLMs)面临的主要挑战。其生成的错误或虚构内容会损害可靠性和用户信任。近期研究表明，当生成虚假陈述时，LLMs的内部状态会编码真实性信息。然而这些研究通常依赖于缺乏真实性的合成数据集，限制了在评估模型生成文本事实准确性时的泛化能力。本文通过研究真实性编码能力对前人研究结论提出质疑，并由此生成更具现实性和挑战性的数据集。具体而言，我们通过以下方式扩展了先前工作：(1)提出从表格数据中采样合理真伪事实句的策略；(2)设计从问答集合生成依赖于LLMs的真实真伪数据集的流程。对两个开源LLMs的分析表明，虽然前人研究结论得到部分验证，但向LLM生成数据集的泛化仍具挑战性。本研究为LLMs事实性领域的未来研究奠定基础，并为更有效的评估提供实用指南。

DuFFin: A Dual-Level Fingerprinting Framework for LLMs IP Protection

Abstract

arXiv:2505.16530v1 Announce Type: cross Abstract: Large language models (LLMs) are considered valuable Intellectual Properties (IP) for legitimate owners due to the enormous computational cost of training. It is crucial to protect the IP of LLMs from malicious stealing or unauthorized deployment. Despite existing efforts in watermarking and fingerprinting LLMs, these methods either impact the text generation process or are limited in white-box access to the suspect model, making them impractical. Hence, we propose DuFFin, a novel $\textbf{Du}$ al-Level $\textbf{Fin}$ gerprinting $\textbf{F}$ ramework for black-box setting ownership verification. DuFFin extracts the trigger pattern and the knowledge-level fingerprints to identify the source of a suspect model. We conduct experiments on a variety of models collected from the open-source website, including four popular base models as protected LLMs and their fine-tuning, quantization, and safety alignment versions, which are released by large companies, start-ups, and individual users. Results show that our method can accurately verify the copyright of the base protected LLM on their model variants, achieving the IP-ROC metric greater than 0.95. Our code is available at https://github.com/yuliangyan0807/llm-fingerprint.

摘要

由于训练所需的高昂计算成本，大语言模型（LLMs）被视为合法持有者的重要知识产权（IP）。保护LLMs的知识产权免受恶意窃取或未经授权部署至关重要。尽管现有研究在LLMs水印和指纹识别方面做出努力，但这些方法要么影响文本生成过程，要么仅限于对可疑模型的白盒访问，导致实用性不足。为此，我们提出DuFFin——一种面向黑盒设置所有权验证的新型双层级指纹识别框架。DuFFin通过提取触发模式和知识层级指纹来识别可疑模型的来源。我们在开源网站收集的多种模型上进行实验，包括由大型企业、初创公司及个人用户发布的四种流行基模型（作为受保护LLMs）及其微调、量化和安全对齐版本。结果表明，本方法能准确验证基模型在其变体上的版权，IP-ROC指标超过0.95。代码已开源：https://github.com/yuliangyan0807/llm-fingerprint。

CUB: Benchmarking Context Utilisation Techniques for Language Models

Abstract

arXiv:2505.16518v1 Announce Type: cross Abstract: Incorporating external knowledge is crucial for knowledge-intensive tasks, such as question answering and fact checking. However, language models (LMs) may ignore relevant information that contradicts outdated parametric memory or be distracted by irrelevant contexts. While many context utilisation manipulation techniques (CMTs) that encourage or suppress context utilisation have recently been proposed to alleviate these issues, few have seen systematic comparison. In this paper, we develop CUB (Context Utilisation Benchmark) to help practitioners within retrieval-augmented generation (RAG) identify the best CMT for their needs. CUB allows for rigorous testing on three distinct context types, observed to capture key challenges in realistic context utilisation scenarios. With this benchmark, we evaluate seven state-of-the-art methods, representative of the main categories of CMTs, across three diverse datasets and tasks, applied to nine LMs. Our results show that most of the existing CMTs struggle to handle the full set of types of contexts that may be encountered in real-world retrieval-augmented scenarios. Moreover, we find that many CMTs display an inflated performance on simple synthesised datasets, compared to more realistic datasets with naturally occurring samples. Altogether, our results show the need for holistic tests of CMTs and the development of CMTs that can handle multiple context types.

摘要

在知识密集型任务（如问答和事实核查）中，融入外部知识至关重要。然而，语言模型（LMs）可能忽略与过时参数记忆相矛盾的相关信息，或受无关上下文干扰。尽管近期提出了许多鼓励或抑制上下文利用的上下文操纵技术（CMTs）以缓解这些问题，但鲜有研究进行系统比较。本文开发了CUB（上下文利用基准测试），帮助检索增强生成（RAG）领域的实践者根据需求选择最佳CMT。CUB支持对三种不同上下文类型进行严格测试，这些类型被证实能捕捉现实上下文利用场景中的关键挑战。基于该基准，我们评估了代表CMT主要类别的七种前沿方法，涵盖三个多样化数据集和任务，并应用于九种LMs。结果表明，现有大多数CMTs难以处理现实检索增强场景中可能遇到的所有上下文类型。此外，我们发现许多CMTs在简单合成数据集上表现虚高，而在包含自然样本的更现实数据集中表现欠佳。总体而言，我们的研究结果揭示了全面测试CMTs的必要性，以及开发能处理多种上下文类型的CMTs的需求。

Steering Large Language Models for Machine Translation Personalization

Abstract

arXiv:2505.16612v1 Announce Type: cross Abstract: High-quality machine translation systems based on large language models (LLMs) have simplified the production of personalized translations reflecting specific stylistic constraints. However, these systems still struggle in settings where stylistic requirements are less explicit and might be harder to convey via prompting. We explore various strategies for personalizing LLM-generated translations in low-resource settings, focusing on the challenging literary translation domain. We explore prompting strategies and inference-time interventions for steering model generations towards a personalized style, and propose a contrastive framework exploiting latent concepts extracted from sparse autoencoders to identify salient personalization properties. Our results show that steering achieves strong personalization while preserving translation quality. We further examine the impact of steering on LLM representations, finding model layers with a relevant impact for personalization are impacted similarly by multi-shot prompting and our steering method, suggesting similar mechanism at play.

摘要

基于大语言模型（LLM）的高质量机器翻译系统简化了反映特定风格约束的个性化翻译生产。然而，在风格要求较不明确且难以通过提示传达的场景中，这些系统仍面临挑战。我们探索了在低资源环境下个性化LLM生成翻译的多种策略，重点关注具有挑战性的文学翻译领域。我们研究了引导模型生成个性化风格的提示策略和推理时干预方法，并提出一种对比框架，利用从稀疏自编码器提取的潜在概念来识别显著个性化特征。结果表明，引导方法在保持翻译质量的同时实现了强烈的个性化效果。我们进一步考察了引导对LLM表征的影响，发现与个性化相关的模型层在多示例提示和我们的引导方法下受到相似影响，暗示二者存在相似的作用机制。

Collaboration among Multiple Large Language Models for Medical Question Answering

Abstract

arXiv:2505.16648v1 Announce Type: cross Abstract: Empowered by vast internal knowledge reservoir, the new generation of large language models (LLMs) demonstrate untapped potential to tackle medical tasks. However, there is insufficient effort made towards summoning up a synergic effect from multiple LLMs' expertise and background. In this study, we propose a multi-LLM collaboration framework tailored on a medical multiple-choice questions dataset. Through post-hoc analysis on 3 pre-trained LLM participants, our framework is proved to boost all LLMs reasoning ability as well as alleviate their divergence among questions. We also measure an LLM's confidence when it confronts with adversary opinions from other LLMs and observe a concurrence between LLM's confidence and prediction accuracy.

摘要

新一代大语言模型（LLMs）凭借其庞大的内部知识储备，展现出解决医学任务的未开发潜力。然而，目前尚未充分探索如何协同利用多个LLMs的专业知识和背景以产生增效作用。本研究提出一个针对医学选择题数据集设计的多LLM协作框架。通过对3个预训练LLM参与者的后验分析，证实该框架能提升所有LLMs的推理能力，并减少它们在问题判断上的分歧。我们还测量了当LLM面对其他LLMs的反对意见时所表现出的置信度，并观察到LLM的置信度与预测准确性之间存在一致性。

Finetuning-Activated Backdoors in LLMs

Abstract

arXiv:2505.16567v1 Announce Type: cross Abstract: Finetuning openly accessible Large Language Models (LLMs) has become standard practice for achieving task-specific performance improvements. Until now, finetuning has been regarded as a controlled and secure process in which training on benign datasets led to predictable behaviors. In this paper, we demonstrate for the first time that an adversary can create poisoned LLMs that initially appear benign but exhibit malicious behaviors once finetuned by downstream users. To this end, our proposed attack, FAB (Finetuning-Activated Backdoor), poisons an LLM via meta-learning techniques to simulate downstream finetuning, explicitly optimizing for the emergence of malicious behaviors in the finetuned models. At the same time, the poisoned LLM is regularized to retain general capabilities and to exhibit no malicious behaviors prior to finetuning. As a result, when users finetune the seemingly benign model on their own datasets, they unknowingly trigger its hidden backdoor behavior. We demonstrate the effectiveness of FAB across multiple LLMs and three target behaviors: unsolicited advertising, refusal, and jailbreakability. Additionally, we show that FAB-backdoors are robust to various finetuning choices made by the user (e.g., dataset, number of steps, scheduler). Our findings challenge prevailing assumptions about the security of finetuning, revealing yet another critical attack vector exploiting the complexities of LLMs.

摘要

对公开可用的大型语言模型（LLM）进行微调已成为实现任务特定性能提升的标准做法。迄今为止，微调一直被视为一个可控且安全的过程，即在良性数据集上训练会产生可预测的行为。本文首次证明，攻击者可以创建被投毒的LLM，这些模型初始表现正常，但在下游用户微调后会显现恶意行为。为此，我们提出的攻击方法FAB（微调激活后门）通过元学习技术对LLM进行投毒，模拟下游微调过程，明确优化微调后模型中恶意行为的显现。同时，被投毒的LLM经过正则化处理，既保留了通用能力，又在微调前不表现出任何恶意行为。因此，当用户在自己的数据集上微调这个看似正常的模型时，会无意间触发其隐藏的后门行为。我们在多个LLM和三种目标行为（未经请求的广告推送、拒绝响应和越狱能力）上验证了FAB的有效性。此外，我们还证明FAB后门对用户不同的微调选择（如数据集、训练步长、调度器等）具有鲁棒性。这些发现挑战了当前关于微调安全性的普遍假设，揭示了利用LLM复杂性的又一关键攻击途径。

O $^2$ -Searcher: A Searching-based Agent Model for Open-Domain Open-Ended Question Answering

Abstract

arXiv:2505.16582v1 Announce Type: cross Abstract: Large Language Models (LLMs), despite their advancements, are fundamentally limited by their static parametric knowledge, hindering performance on tasks requiring open-domain up-to-date information. While enabling LLMs to interact with external knowledge environments is a promising solution, current efforts primarily address closed-end problems. Open-ended questions, which characterized by lacking a standard answer or providing non-unique and diverse answers, remain underexplored. To bridge this gap, we present O $^2$ -Searcher, a novel search agent leveraging reinforcement learning to effectively tackle both open-ended and closed-ended questions in the open domain. O $^2$ -Searcher leverages an efficient, locally simulated search environment for dynamic knowledge acquisition, effectively decoupling the external world knowledge from model's sophisticated reasoning processes. It employs a unified training mechanism with meticulously designed reward functions, enabling the agent to identify problem types and adapt different answer generation strategies. Furthermore, to evaluate performance on complex open-ended tasks, we construct O $^2$ -QA, a high-quality benchmark featuring 300 manually curated, multi-domain open-ended questions with associated web page caches. Extensive experiments show that O $^2$ -Searcher, using only a 3B model, significantly surpasses leading LLM agents on O $^2$ -QA. It also achieves SOTA results on various closed-ended QA benchmarks against similarly-sized models, while performing on par with much larger ones.

摘要

大型语言模型（LLMs）尽管取得了显著进展，但其静态参数化知识的固有局限性阻碍了在需要开放领域最新信息的任务上的表现。虽然让LLMs与外部知识环境交互是一种有前景的解决方案，但当前研究主要针对封闭式问题。开放式问题因缺乏标准答案或具有非唯一性、多样性答案的特点，仍未被充分探索。为弥补这一空白，我们提出O²-Searcher——一种基于强化学习的新型搜索代理，能有效处理开放域中的开放式与封闭式问题。该代理通过高效的本地模拟搜索环境实现动态知识获取，将外部世界知识与模型的复杂推理过程有效解耦。我们采用统一训练机制配合精心设计的奖励函数，使代理能识别问题类型并适配不同的答案生成策略。此外，为评估复杂开放式任务的表现，我们构建了O²-QA基准测试集，包含300个手工筛选的多领域开放式问题及关联网页缓存。大量实验表明，仅使用30亿参数的O²-Searcher在O²-QA上显著超越主流LLM代理，同时在各类封闭式QA基准测试中达到同尺寸模型的最高水平，其性能甚至可比肩更大规模的模型。

SSR-Zero: Simple Self-Rewarding Reinforcement Learning for Machine Translation

Abstract

arXiv:2505.16637v1 Announce Type: cross Abstract: Large language models (LLMs) have recently demonstrated remarkable capabilities in machine translation (MT). However, most advanced MT-specific LLMs heavily rely on external supervision signals during training, such as human-annotated reference data or trained reward models (RMs), which are often expensive to obtain and challenging to scale. To overcome this limitation, we propose a Simple Self-Rewarding (SSR) Reinforcement Learning (RL) framework for MT that is reference-free, fully online, and relies solely on self-judging rewards. Training with SSR using 13K monolingual examples and Qwen-2.5-7B as the backbone, our model SSR-Zero-7B outperforms existing MT-specific LLMs, e.g., TowerInstruct-13B and GemmaX-28-9B, as well as larger general LLMs like Qwen2.5-32B-Instruct in English $\leftrightarrow$ Chinese translation tasks from WMT23, WMT24, and Flores200 benchmarks. Furthermore, by augmenting SSR with external supervision from COMET, our strongest model, SSR-X-Zero-7B, achieves state-of-the-art performance in English $\leftrightarrow$ Chinese translation, surpassing all existing open-source models under 72B parameters and even outperforming closed-source models, e.g., GPT-4o and Gemini 1.5 Pro. Our analysis highlights the effectiveness of the self-rewarding mechanism compared to the external LLM-as-a-judge approach in MT and demonstrates its complementary benefits when combined with trained RMs. Our findings provide valuable insight into the potential of self-improving RL methods. We have publicly released our code, data and models.

摘要

大型语言模型（LLMs）近期在机器翻译（MT）领域展现出卓越能力。然而，大多数先进的MT专用LLMs在训练过程中严重依赖外部监督信号，如人工标注的参考数据或训练好的奖励模型（RMs），这些资源通常成本高昂且难以扩展。为突破这一局限，我们提出一种简单自奖励（SSR）强化学习（RL）框架，该框架无需参考译文、完全在线运行，且仅依赖自我评判奖励。基于Qwen-2.5-7B模型架构，使用13K单语样本进行SSR训练后，我们的SSR-Zero-7B模型在WMT23、WMT24和Flores200基准测试的英汉互译任务中，表现优于现有MT专用LLMs（如TowerInstruct-13B和GemmaX-28-9B）以及Qwen2.5-32B-Instruct等更大规模的通用LLMs。进一步通过COMET外部监督增强SSR后，我们最强的SSR-X-Zero-7B模型实现了英汉互译的顶尖性能，超越所有72B参数以下的开源模型，甚至优于GPT-4o和Gemini 1.5 Pro等闭源模型。分析表明，与外部LLM评判机制相比，自奖励机制在MT中更具效力，且与训练好的RMs结合时能产生互补优势。这些发现为自改进RL方法的潜力提供了重要见解。我们已公开代码、数据及模型。

Abstract

arXiv:2505.16673v1 Announce Type: cross Abstract: In this work, we aim to incentivize the reasoning ability of Multimodal Large Language Models (MLLMs) via reinforcement learning (RL) and develop an effective approach that mitigates the sparse reward and advantage vanishing issues during RL. To this end, we propose Share-GRPO, a novel RL approach that tackle these issues by exploring and sharing diverse reasoning trajectories over expanded question space. Specifically, Share-GRPO first expands the question space for a given question via data transformation techniques, and then encourages MLLM to effectively explore diverse reasoning trajectories over the expanded question space and shares the discovered reasoning trajectories across the expanded questions during RL. In addition, Share-GRPO also shares reward information during advantage computation, which estimates solution advantages hierarchically across and within question variants, allowing more accurate estimation of relative advantages and improving the stability of policy training. Extensive evaluations over six widely-used reasoning benchmarks showcase the superior performance of our method. Code will be available at https://github.com/HJYao00/R1-ShareVL.

摘要

在本工作中，我们旨在通过强化学习（RL）激发多模态大语言模型（MLLMs）的推理能力，并开发一种有效方法以缓解RL过程中的稀疏奖励和优势消失问题。为此，我们提出Share-GRPO这一新型RL方法，通过在扩展问题空间中探索和共享多样化推理轨迹来解决这些问题。具体而言，Share-GRPO首先通过数据转换技术为给定问题扩展问题空间，随后鼓励MLLM在扩展问题空间上有效探索多样化推理轨迹，并在RL过程中将发现的推理轨迹在扩展问题间共享。此外，Share-GRPO还在优势计算过程中共享奖励信息，通过分层估计问题变体间和变体内的解决方案优势，从而更准确地评估相对优势并提升策略训练的稳定性。在六个广泛使用的推理基准上的大量评估证明了我们方法的优越性能。代码将在https://github.com/HJYao00/R1-ShareVL发布。

Beyond Induction Heads: In-Context Meta Learning Induces Multi-Phase Circuit Emergence

Abstract

arXiv:2505.16694v1 Announce Type: cross Abstract: Transformer-based language models exhibit In-Context Learning (ICL), where predictions are made adaptively based on context. While prior work links induction heads to ICL through a sudden jump in accuracy, this can only account for ICL when the answer is included within the context. However, an important property of practical ICL in large language models is the ability to meta-learn how to solve tasks from context, rather than just copying answers from context; how such an ability is obtained during training is largely unexplored. In this paper, we experimentally clarify how such meta-learning ability is acquired by analyzing the dynamics of the model's circuit during training. Specifically, we extend the copy task from previous research into an In-Context Meta Learning setting, where models must infer a task from examples to answer queries. Interestingly, in this setting, we find that there are multiple phases in the process of acquiring such abilities, and that a unique circuit emerges in each phase, contrasting with the single-phases change in induction heads. The emergence of such circuits can be related to several phenomena known in large language models, and our analysis lead to a deeper understanding of the source of the transformer's ICL ability.

摘要

基于Transformer的语言模型展现出上下文学习（ICL）能力，其预测能够根据上下文自适应调整。尽管先前研究通过准确率的突变将归纳头与ICL联系起来，但这仅能解释答案包含在上下文中的ICL场景。然而，大型语言模型中实用ICL的关键特性在于能够从上下文元学习任务解决方法，而非简单复制答案——这种能力在训练过程中如何形成尚不明确。本文通过分析训练过程中模型电路的动态变化，实验性地阐明了此类元学习能力的获取机制。具体而言，我们将前人研究中的复制任务扩展为上下文元学习场景，要求模型通过示例推断任务以回答查询。有趣的是，在此设定下，我们发现能力获取过程存在多个阶段，每个阶段都会涌现独特的电路结构，这与归纳头单一阶段的变化形成鲜明对比。此类电路的涌现可与大型语言模型中若干已知现象相关联，我们的分析为理解Transformer的ICL能力来源提供了更深入的见解。

From Evaluation to Defense: Advancing Safety in Video Large Language Models

Abstract

arXiv:2505.16643v1 Announce Type: cross Abstract: While the safety risks of image-based large language models have been extensively studied, their video-based counterparts (Video LLMs) remain critically under-examined. To systematically study this problem, we introduce \textbf{VideoSafetyBench (VSB-77k) - the first large-scale, culturally diverse benchmark for Video LLM safety}, which compromises 77,646 video-query pairs and spans 19 principal risk categories across 10 language communities. \textit{We reveal that integrating video modality degrades safety performance by an average of 42.3%, exposing systemic risks in multimodal attack exploitation.} To address this vulnerability, we propose \textbf{VideoSafety-R1}, a dual-stage framework achieving unprecedented safety gains through two innovations: (1) Alarm Token-Guided Safety Fine-Tuning (AT-SFT) injects learnable alarm tokens into visual and textual sequences, enabling explicit harm perception across modalities via multitask objectives. (2) Then, Safety-Guided GRPO enhances defensive reasoning through dynamic policy optimization with rule-based rewards derived from dual-modality verification. These components synergize to shift safety alignment from passive harm recognition to active reasoning. The resulting framework achieves a 65.1% improvement on VSB-Eval-HH, and improves by 59.1%, 44.3%, and 15.0% on the image safety datasets MMBench, VLGuard, and FigStep, respectively. \textit{Our codes are available in the supplementary materials.} \textcolor{red}{Warning: This paper contains examples of harmful language and videos, and reader discretion is recommended.}

摘要

虽然基于图像的大型语言模型安全风险已得到广泛研究，但其视频版本（视频大语言模型）的安全性仍缺乏系统评估。为系统研究该问题，我们提出首个大规模、文化多样性的视频大语言模型安全基准——视频安全基准（VSB-77k），包含77,646个视频-查询对，涵盖10种语言社区下的19个主要风险类别。研究发现视频模态的引入会使安全性能平均下降42.3%，暴露出多模态攻击利用中的系统性风险。针对此漏洞，我们提出VideoSafety-R1双阶段框架，通过两项创新实现突破性安全提升：(1) 警报令牌引导的安全微调（AT-SFT）将可学习警报令牌注入视觉与文本序列，通过多任务目标实现跨模态显式危害感知；(2) 安全引导GRPO通过基于双模态验证的规则奖励机制进行动态策略优化，增强防御性推理。这些组件协同作用，将安全对齐从被动危害识别转变为主动推理。最终框架在VSB-Eval-HH上实现65.1%的性能提升，在图像安全数据集MMBench、VLGuard和FigStep上分别提升59.1%、44.3%和15.0%。代码详见补充材料。警告：本文包含有害语言及视频示例，建议谨慎阅读。

BitHydra: Towards Bit-flip Inference Cost Attack against Large Language Models

Abstract

arXiv:2505.16670v1 Announce Type: cross Abstract: Large language models (LLMs) have shown impressive capabilities across a wide range of applications, but their ever-increasing size and resource demands make them vulnerable to inference cost attacks, where attackers induce victim LLMs to generate the longest possible output content. In this paper, we revisit existing inference cost attacks and reveal that these methods can hardly produce large-scale malicious effects since they are self-targeting, where attackers are also the users and therefore have to execute attacks solely through the inputs, whose generated content will be charged by LLMs and can only directly influence themselves. Motivated by these findings, this paper introduces a new type of inference cost attacks (dubbed 'bit-flip inference cost attack') that target the victim model itself rather than its inputs. Specifically, we design a simple yet effective method (dubbed 'BitHydra') to effectively flip critical bits of model parameters. This process is guided by a loss function designed to suppress <EOS> token's probability with an efficient critical bit search algorithm, thus explicitly defining the attack objective and enabling effective optimization. We evaluate our method on 11 LLMs ranging from 1.5B to 14B parameters under both int8 and float16 settings. Experimental results demonstrate that with just 4 search samples and as few as 3 bit flips, BitHydra can force 100% of test prompts to reach the maximum generation length (e.g., 2048 tokens) on representative LLMs such as LLaMA3, highlighting its efficiency, scalability, and strong transferability across unseen inputs.

摘要

大语言模型（LLMs）在广泛的应用中展现出卓越能力，但其不断增长的规模与资源需求使其易受推理成本攻击，即攻击者诱导受害LLM生成尽可能长的输出内容。本文重新审视现有推理成本攻击方法，发现这些方法难以产生大规模恶意影响，因其属于自我靶向攻击——攻击者同时作为用户，仅能通过输入执行攻击，而生成内容将由LLM计费且仅直接影响自身。基于此发现，本文提出一种新型推理成本攻击（称为'比特翻转推理成本攻击'），其直接针对受害模型而非输入。具体而言，我们设计了一种简单高效的方法（称为'BitHydra'），可有效翻转模型参数的关键比特。该过程通过设计损失函数抑制<EOS>标记概率，并采用高效的关键比特搜索算法进行引导，从而明确定义攻击目标并实现有效优化。我们在11个参数量从1.5B到14B的LLM上（涵盖int8和float16两种设置）评估该方法。实验结果表明：仅需4个搜索样本和最少3次比特翻转，BitHydra即可在LLaMA3等代表性模型上强制100%测试提示达到最大生成长度（如2048个标记），凸显其高效性、可扩展性及对未见输入的强迁移性。

Breaking mBad! Supervised Fine-tuning for Cross-Lingual Detoxification

Abstract

arXiv:2505.16722v1 Announce Type: cross Abstract: As large language models (LLMs) become increasingly prevalent in global applications, ensuring that they are toxicity-free across diverse linguistic contexts remains a critical challenge. We explore "Cross-lingual Detoxification", a cross-lingual paradigm that mitigates toxicity, enabling detoxification capabilities to transfer between high and low-resource languages across different script families. We analyze cross-lingual detoxification's effectiveness through 504 extensive settings to evaluate toxicity reduction in cross-distribution settings with limited data and investigate how mitigation impacts model performance on non-toxic tasks, revealing trade-offs between safety and knowledge preservation. Our code and dataset are publicly available at https://github.com/himanshubeniwal/Breaking-mBad.

摘要

随着大语言模型（LLMs）在全球应用中的日益普及，确保其在多样化的语言环境中无毒性仍是一项关键挑战。本研究探讨了'跨语言去毒'这一跨语言范式，该范式能够缓解毒性，使去毒能力在不同文字体系的高资源与低资源语言间实现迁移。我们通过504组广泛实验设置分析了跨语言去毒的有效性，评估了数据有限情况下跨分布场景的毒性降低效果，并研究了去毒处理对模型在非毒性任务上表现的影响，揭示了安全性与知识保留之间的权衡关系。相关代码与数据集已公开于https://github.com/himanshubeniwal/Breaking-mBad。

Your Pre-trained LLM is Secretly an Unsupervised Confidence Calibrator

Abstract

arXiv:2505.16690v1 Announce Type: cross Abstract: Post-training of large language models is essential for adapting pre-trained language models (PLMs) to align with human preferences and downstream tasks. While PLMs typically exhibit well-calibrated confidence, post-trained language models (PoLMs) often suffer from over-confidence, assigning high confidence to both correct and incorrect outputs, which can undermine reliability in critical applications. A major obstacle in calibrating PoLMs is the scarcity of labeled data for individual downstream tasks. To address this, we propose Disagreement-Aware Confidence Alignment (DACA), a novel unsupervised method to optimize the parameters (e.g., temperature $\tau$ ) in post-hoc confidence calibration. Our method is motivated by the under-confidence issue caused by prediction disagreement between the PLM and PoLM while aligning their confidence via temperature scaling. Theoretically, the PLM's confidence underestimates PoLM's prediction accuracy on disagreement examples, causing a larger $\tau$ and producing under-confident predictions. DACA mitigates this by selectively using only agreement examples for calibration, effectively decoupling the influence of disagreement. In this manner, our method avoids an overly large $\tau$ in temperature scaling caused by disagreement examples, improving calibration performance. Extensive experiments demonstrate the effectiveness of our method, improving the average ECE of open-sourced and API-based LLMs (e.g. GPT-4o) by up to 15.08 $\%$ on common benchmarks.

摘要

大型语言模型的训练后调优对于使预训练语言模型（PLMs）适应人类偏好和下游任务至关重要。虽然PLMs通常表现出良好的置信度校准，但训练后语言模型（PoLMs）往往存在过度自信问题，对正确和错误输出均赋予高置信度，这可能影响关键应用中的可靠性。校准PoLMs的主要障碍在于下游任务的标注数据稀缺。为此，我们提出分歧感知置信对齐（DACA），一种新颖的无监督方法，用于优化事后置信度校准中的参数（如温度 $\tau$ ）。该方法的动机在于：当通过温度缩放对齐PLM与PoLM的置信度时，两者预测分歧会导致欠自信问题。理论上，PLM的置信度会低估PoLM在分歧样本上的预测准确率，从而导致更大的 $\tau$ 并产生欠自信预测。DACA通过选择性仅使用一致样本进行校准来缓解该问题，有效解除了分歧的影响。通过这种方式，我们的方法避免了温度缩放中因分歧样本导致 $\tau$ 过大的问题，从而提升校准性能。大量实验表明，该方法在常见基准测试中将开源及API型LLM（如GPT-4o）的平均ECE指标最高提升15.08%。

Training Long-Context LLMs Efficiently via Chunk-wise Optimization

Abstract

arXiv:2505.16710v1 Announce Type: cross Abstract: While long-context large language models (LLMs) exhibit remarkable document processing capabilities, their prohibitively high training costs often hinder customized applications. To mitigate this issue, we propose \textit{Sequential Chunk-wise Optimization} (SeCO), a memory-efficient training paradigm that partitions lengthy inputs into manageable chunks. Each chunk independently constructs its computational graph and performs localized backpropagation, ensuring that only one chunk's forward activations are stored in memory. Building on SeCO, we further introduce \textit{Sparse Chunk-wise Optimization} (SpaCO), which reduces computational overhead by selectively propagating gradients to specific chunks and incorporates a carefully designed compensation factor to ensure unbiased gradient estimation. SpaCO decouples the computational cost of backpropagation from the context length, enabling training time to gradually converge to inference time as sequences become longer. Implemented as lightweight training wrappers, both SeCO and SpaCO offer substantial practical benefits. For example, when fine-tuning an 8B model with LoRA on a single RTX 3090 GPU, SeCO expands maximum sequence length from 1K to 16K tokens, while SpaCO demonstrates accelerated training speed -- achieving up to 3x faster than SeCO under the same experimental setup. These innovations provide new insights into optimizing long-context models, making them more accessible for practical applications. We have open-sourced the code at \href{https://github.com/wenhaoli-xmu/seco}{here}.

摘要

虽然长上下文大语言模型（LLMs）展现出卓越的文档处理能力，但其极高的训练成本往往阻碍了定制化应用。为缓解这一问题，我们提出顺序分块优化（SeCO）——一种内存高效的训练范式，该方法将长输入分割为可管理的块。每个块独立构建其计算图并执行局部反向传播，确保内存中仅存储单个块的前向激活值。基于SeCO，我们进一步提出稀疏分块优化（SpaCO），通过选择性梯度传播至特定块来降低计算开销，并引入精心设计的补偿因子以保证无偏梯度估计。SpaCO将反向传播的计算成本与上下文长度解耦，使得训练时间随序列增长逐渐趋近于推理时间。作为轻量级训练封装器，SeCO与SpaCO均具有显著实用价值。例如在单张RTX 3090 GPU上使用LoRA微调8B模型时，SeCO将最大序列长度从1K扩展到16K标记，而SpaCO展现出加速的训练效率——相同实验设置下比SeCO快达3倍。这些创新为优化长上下文模型提供了新思路，使其更适用于实际场景。代码已开源于此处。

Mitigating Fine-tuning Risks in LLMs via Safety-Aware Probing Optimization

Abstract

arXiv:2505.16737v1 Announce Type: cross Abstract: The significant progress of large language models (LLMs) has led to remarkable achievements across numerous applications. However, their ability to generate harmful content has sparked substantial safety concerns. Despite the implementation of safety alignment techniques during the pre-training phase, recent research indicates that fine-tuning LLMs on adversarial or even benign data can inadvertently compromise their safety. In this paper, we re-examine the fundamental issue of why fine-tuning on non-harmful data still results in safety degradation. We introduce a safety-aware probing (SAP) optimization framework designed to mitigate the safety risks of fine-tuning LLMs. Specifically, SAP incorporates a safety-aware probe into the gradient propagation process, mitigating the model's risk of safety degradation by identifying potential pitfalls in gradient directions, thereby enhancing task-specific performance while successfully preserving model safety. Our extensive experimental results demonstrate that SAP effectively reduces harmfulness below the original fine-tuned model and achieves comparable test loss to standard fine-tuning methods. Our code is available at https://github.com/ChengcanWu/SAP.

摘要

大型语言模型（LLMs）的重大进展使其在众多应用中取得了显著成就。然而，其生成有害内容的能力引发了严重的安全隐患。尽管在预训练阶段已采用安全对齐技术，但近期研究表明，即使在对抗性或良性数据上对LLMs进行微调，也可能无意中损害其安全性。本文重新审视了为何在非有害数据上微调仍会导致安全性下降这一根本问题。我们提出了一种安全感知探测（SAP）优化框架，旨在降低LLMs微调的安全风险。具体而言，SAP将安全感知探针引入梯度传播过程，通过识别梯度方向中的潜在陷阱来降低模型安全性退化的风险，从而在提升任务特定性能的同时有效保持模型安全性。大量实验结果表明，SAP成功将有害性降至原始微调模型以下，并达到与标准微调方法相当的测试损失。代码已开源：https://github.com/ChengcanWu/SAP。

CoTSRF: Utilize Chain of Thought as Stealthy and Robust Fingerprint of Large Language Models

Abstract

arXiv:2505.16785v1 Announce Type: cross Abstract: Despite providing superior performance, open-source large language models (LLMs) are vulnerable to abusive usage. To address this issue, recent works propose LLM fingerprinting methods to identify the specific source LLMs behind suspect applications. However, these methods fail to provide stealthy and robust fingerprint verification. In this paper, we propose a novel LLM fingerprinting scheme, namely CoTSRF, which utilizes the Chain of Thought (CoT) as the fingerprint of an LLM. CoTSRF first collects the responses from the source LLM by querying it with crafted CoT queries. Then, it applies contrastive learning to train a CoT extractor that extracts the CoT feature (i.e., fingerprint) from the responses. Finally, CoTSRF conducts fingerprint verification by comparing the Kullback-Leibler divergence between the CoT features of the source and suspect LLMs against an empirical threshold. Various experiments have been conducted to demonstrate the advantage of our proposed CoTSRF for fingerprinting LLMs, particularly in stealthy and robust fingerprint verification.

摘要

尽管开源大语言模型（LLM）具备卓越性能，但其存在被滥用的风险。为解决这一问题，近期研究提出了LLM指纹识别方法，用于追溯可疑应用背后的特定源模型。然而，现有方法无法实现隐蔽且鲁棒的指纹验证。本文提出一种新型LLM指纹方案CoTSRF，其利用思维链（CoT）作为LLM的指纹特征。该方案首先通过设计的CoT查询收集源模型的响应，随后采用对比学习训练CoT提取器以从响应中获取CoT特征（即指纹），最终通过比较源模型与可疑模型CoT特征间的Kullback-Leibler散度与经验阈值来实现指纹验证。大量实验表明，所提出的CoTSRF方案在LLM指纹识别方面具有显著优势，尤其在隐蔽性和鲁棒性验证方面表现突出。

When Safety Detectors Aren't Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques

Abstract

arXiv:2505.16765v1 Announce Type: cross Abstract: Jailbreak attacks pose a serious threat to large language models (LLMs) by bypassing built-in safety mechanisms and leading to harmful outputs. Studying these attacks is crucial for identifying vulnerabilities and improving model security. This paper presents a systematic survey of jailbreak methods from the novel perspective of stealth. We find that existing attacks struggle to simultaneously achieve toxic stealth (concealing toxic content) and linguistic stealth (maintaining linguistic naturalness). Motivated by this, we propose StegoAttack, a fully stealthy jailbreak attack that uses steganography to hide the harmful query within benign, semantically coherent text. The attack then prompts the LLM to extract the hidden query and respond in an encrypted manner. This approach effectively hides malicious intent while preserving naturalness, allowing it to evade both built-in and external safety mechanisms. We evaluate StegoAttack on four safety-aligned LLMs from major providers, benchmarking against eight state-of-the-art methods. StegoAttack achieves an average attack success rate (ASR) of 92.00%, outperforming the strongest baseline by 11.0%. Its ASR drops by less than 1% even under external detection (e.g., Llama Guard). Moreover, it attains the optimal comprehensive scores on stealth detection metrics, demonstrating both high efficacy and exceptional stealth capabilities. The code is available at https://anonymous.4open.science/r/StegoAttack-Jail66

摘要

越狱攻击通过绕过大型语言模型（LLMs）内置的安全机制并产生有害输出，对其构成严重威胁。研究这些攻击对于识别漏洞和提升模型安全性至关重要。本文从隐蔽性这一新颖视角出发，对越狱方法进行了系统性综述。我们发现现有攻击难以同时实现毒性隐蔽（隐藏有害内容）和语言隐蔽（保持语言自然性）。基于此，我们提出StegoAttack——一种完全隐蔽的越狱攻击，其利用隐写术将有害查询隐藏在语义连贯的良性文本中。该攻击随后诱导LLM提取隐藏查询并以加密方式响应。该方法在保持自然性的同时有效掩盖恶意意图，从而规避内置及外部安全机制。我们在四大主流厂商的安全对齐LLMs上评估StegoAttack，并与八种前沿方法进行基准测试。StegoAttack平均攻击成功率（ASR）达92.00%，较最强基线提升11.0%。即使在外部分析（如Llama Guard）下，其ASR降幅不足1%。此外，该攻击在隐蔽性检测指标上获得最优综合评分，展现出高效性与卓越隐蔽能力。代码详见https://anonymous.4open.science/r/StegoAttack-Jail66。

Accidental Misalignment: Fine-Tuning Language Models Induces Unexpected Vulnerability

Abstract

arXiv:2505.16789v1 Announce Type: cross Abstract: As large language models gain popularity, their vulnerability to adversarial attacks remains a primary concern. While fine-tuning models on domain-specific datasets is often employed to improve model performance, it can introduce vulnerabilities within the underlying model. In this work, we investigate Accidental Misalignment, unexpected vulnerabilities arising from characteristics of fine-tuning data. We begin by identifying potential correlation factors such as linguistic features, semantic similarity, and toxicity within our experimental datasets. We then evaluate the adversarial performance of these fine-tuned models and assess how dataset factors correlate with attack success rates. Lastly, we explore potential causal links, offering new insights into adversarial defense strategies and highlighting the crucial role of dataset design in preserving model alignment. Our code is available at https://github.com/psyonp/accidental_misalignment.

摘要

随着大型语言模型的普及，其对抗攻击的脆弱性仍是首要关注问题。尽管领域特定数据集的微调常被用于提升模型性能，但这一过程可能在基础模型中引入新的漏洞。本研究探讨了'意外失准'现象——即由微调数据特性引发的非预期脆弱性。我们首先在实验数据集中识别了潜在关联因素（如语言特征、语义相似度和毒性），随后评估这些微调模型的对抗性能，并分析数据集因素与攻击成功率的相关性。最后，我们探索了潜在的因果关系，为对抗防御策略提供新见解，同时揭示了数据集设计在维持模型校准中的关键作用。

TRIM: Achieving Extreme Sparsity with Targeted Row-wise Iterative Metric-driven Pruning

Abstract

arXiv:2505.16743v1 Announce Type: cross Abstract: Large Language Models (LLMs) present significant computational and memory challenges due to their extensive size, making pruning essential for their efficient deployment. Existing one-shot pruning methods often apply uniform sparsity constraints across layers or within each layer, resulting in suboptimal performance, especially at high sparsity ratios. This work introduces TRIM (Targeted Row-wise Iterative Metric-driven pruning), a novel approach that applies varying sparsity ratios to individual output dimensions (rows) within each layer. TRIM employs an iterative adjustment process guided by quality metrics to optimize dimension-wise sparsity allocation, focusing on reducing variance in quality retention across outputs to preserve critical information. TRIM can be seamlessly integrated with existing layer-wise pruning strategies. Our evaluations on perplexity and zero-shot tasks across diverse LLM families (Qwen2.5, LLaMA-2, and OPT) and sparsity levels demonstrate that TRIM achieves new state-of-the-art results and enhances stability. For instance, at 80% sparsity, TRIM reduces perplexity by 48% for Qwen2.5-14B and over 90% for OPT-13B compared to baseline methods. We conclude that fine-grained, dimension-wise sparsity adaptation is crucial for pushing the limits of extreme LLM compression. Code available at: https://github.com/flobk/TRIM

摘要

大型语言模型（LLMs）因其庞大的规模带来了显著的计算和内存挑战，这使得剪枝技术对其高效部署至关重要。现有的一次性剪枝方法通常在各层或每层内部采用统一的稀疏度约束，导致性能欠佳，尤其在较高稀疏比时表现更为明显。本文提出TRIM（目标行向迭代度量驱动剪枝），这是一种创新方法，通过对每层内各输出维度（行）施加不同的稀疏比来实现优化。TRIM采用基于质量指标的迭代调整过程来优化维度级稀疏分配，重点降低输出间质量保留的方差以保护关键信息。该方法可与现有分层剪枝策略无缝集成。我们在多样化LLM系列（Qwen2.5、LLaMA-2和OPT）及不同稀疏度下进行的困惑度和零样本任务评估表明，TRIM实现了新的最先进成果并提升了稳定性。例如在80%稀疏度下，相较于基线方法，TRIM将Qwen2.5-14B的困惑度降低48%，对OPT-13B的降低幅度超过90%。我们得出结论：细粒度的维度级稀疏度适配对于突破极端LLM压缩的极限具有关键作用。

Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs

Abstract

arXiv:2505.16831v1 Announce Type: cross Abstract: Unlearning in large language models (LLMs) is intended to remove the influence of specific data, yet current evaluations rely heavily on token-level metrics such as accuracy and perplexity. We show that these metrics can be misleading: models often appear to forget, but their original behavior can be rapidly restored with minimal fine-tuning, revealing that unlearning may obscure information rather than erase it. To diagnose this phenomenon, we introduce a representation-level evaluation framework using PCA-based similarity and shift, centered kernel alignment, and Fisher information. Applying this toolkit across six unlearning methods, three domains (text, code, math), and two open-source LLMs, we uncover a critical distinction between reversible and irreversible forgetting. In reversible cases, models suffer token-level collapse yet retain latent features; in irreversible cases, deeper representational damage occurs. We further provide a theoretical account linking shallow weight perturbations near output layers to misleading unlearning signals, and show that reversibility is modulated by task type and hyperparameters. Our findings reveal a fundamental gap in current evaluation practices and establish a new diagnostic foundation for trustworthy unlearning in LLMs. We provide a unified toolkit for analyzing LLM representation changes under unlearning and relearning: https://github.com/XiaoyuXU1/Representational_Analysis_Tools.git.

摘要

大语言模型（LLM）的遗忘机制旨在消除特定数据的影响，但当前评估主要依赖准确性和困惑度等词级指标。我们发现这些指标可能产生误导：模型表面看似遗忘，但通过极少量微调即可快速恢复原始行为，表明遗忘可能只是掩盖而非真正消除信息。为诊断该现象，我们提出基于表征层面的评估框架，采用PCA相似性与偏移度、中心核对齐和费舍尔信息等指标。通过将该工具包应用于六种遗忘方法、三大领域（文本、代码、数学）及两种开源LLM，我们揭示了可逆与不可逆遗忘的关键区别：可逆情况下模型虽出现词级崩溃但仍保留潜在特征；不可逆情况下则发生更深层的表征损伤。我们进一步建立理论解释，表明输出层附近的浅层权重扰动会导致误导性遗忘信号，并证明任务类型和超参数可调节可逆性。研究结果揭示了当前评估实践的根本缺陷，为LLM可信遗忘建立了新的诊断基础。我们提供统一工具包用于分析遗忘与再学习过程中的LLM表征变化：https://github.com/XiaoyuXU1/Representational_Analysis_Tools.git。

SimpleDeepSearcher: Deep Information Seeking via Web-Powered Reasoning Trajectory Synthesis

Abstract

arXiv:2505.16834v1 Announce Type: cross Abstract: Retrieval-augmented generation (RAG) systems have advanced large language models (LLMs) in complex deep search scenarios requiring multi-step reasoning and iterative information retrieval. However, existing approaches face critical limitations that lack high-quality training trajectories or suffer from the distributional mismatches in simulated environments and prohibitive computational costs for real-world deployment. This paper introduces SimpleDeepSearcher, a lightweight yet effective framework that bridges this gap through strategic data engineering rather than complex training paradigms. Our approach synthesizes high-quality training data by simulating realistic user interactions in live web search environments, coupled with a multi-criteria curation strategy that optimizes the diversity and quality of input and output side. Experiments on five benchmarks across diverse domains demonstrate that SFT on only 871 curated samples yields significant improvements over RL-based baselines. Our work establishes SFT as a viable pathway by systematically addressing the data-scarce bottleneck, offering practical insights for efficient deep search systems. Our code is available at https://github.com/RUCAIBox/SimpleDeepSearcher.

摘要

检索增强生成（RAG）系统在需要多步推理和迭代信息检索的复杂深度搜索场景中提升了大型语言模型（LLMs）的性能。然而，现有方法面临关键局限：缺乏高质量的训练轨迹，或受限于模拟环境中的分布不匹配问题，以及实际部署时高昂的计算成本。本文提出SimpleDeepSearcher，一个轻量级但高效的框架，通过策略性数据工程而非复杂训练范式来弥合这一差距。我们的方法通过模拟实时网络搜索环境中的真实用户交互来合成高质量训练数据，并结合多标准筛选策略，优化输入与输出端的多样性和质量。在五个跨领域基准测试上的实验表明，仅使用871个精选样本进行监督微调（SFT），即可显著超越基于强化学习的基线方法。本研究通过系统性解决数据稀缺瓶颈，确立了SFT作为可行路径，为高效深度搜索系统提供了实用见解。代码发布于https://github.com/RUCAIBox/SimpleDeepSearcher。

CASTILLO: Characterizing Response Length Distributions of Large Language Models

Abstract

arXiv:2505.16881v1 Announce Type: cross Abstract: Efficiently managing compute resources for Large Language Model (LLM) inference remains challenging due to the inherently stochastic and variable lengths of autoregressive text generation. Accurately estimating response lengths in advance enables proactive resource allocation, yet existing approaches either bias text generation towards certain lengths or rely on assumptions that ignore model- and prompt-specific variability. We introduce CASTILLO, a dataset characterizing response length distributions across 13 widely-used open-source LLMs evaluated on seven distinct instruction-following corpora. For each $\langle$ prompt, model $\rangle$ sample pair, we generate 10 independent completions using fixed decoding hyper-parameters, record the token length of each response, and publish summary statistics (mean, std-dev, percentiles), along with the shortest and longest completions, and the exact generation settings. Our analysis reveals significant inter- and intra-model variability in response lengths (even under identical generation settings), as well as model-specific behaviors and occurrences of partial text degeneration in only subsets of responses. CASTILLO enables the development of predictive models for proactive scheduling and provides a systematic framework for analyzing model-specific generation behaviors. We publicly release the dataset and code to foster research at the intersection of generative language modeling and systems.

摘要

高效管理大型语言模型（LLM）推理的计算资源仍具挑战性，这源于自回归文本生成固有的随机性和可变长度特性。准确预判响应长度能实现主动资源分配，但现有方法要么使文本生成偏向特定长度，要么依赖忽略模型与提示词特异性变异的假设。我们提出CASTILLO数据集，该数据集刻画了13个广泛使用的开源LLM在七个不同指令遵循语料库上的响应长度分布特征。针对每个〈提示词，模型〉样本对，我们使用固定解码超参数生成10个独立补全结果，记录每个响应的标记长度，并发布统计摘要（均值、标准差、百分位数）以及最短/最长补全结果和精确的生成设置。分析表明：响应长度存在显著的模型间与模型内变异（即便在相同生成设置下），同时揭示了模型特异性行为及仅部分响应出现的文本退化现象。CASTILLO为开发主动调度的预测模型提供支持，并建立了分析模型特异性生成行为的系统框架。我们公开数据集和代码以促进生成式语言建模与系统研究的交叉探索。

Don't "Overthink" Passage Reranking: Is Reasoning Truly Necessary?

Abstract

arXiv:2505.16886v1 Announce Type: cross Abstract: With the growing success of reasoning models across complex natural language tasks, researchers in the Information Retrieval (IR) community have begun exploring how similar reasoning capabilities can be integrated into passage rerankers built on Large Language Models (LLMs). These methods typically employ an LLM to produce an explicit, step-by-step reasoning process before arriving at a final relevance prediction. But, does reasoning actually improve reranking accuracy? In this paper, we dive deeper into this question, studying the impact of the reasoning process by comparing reasoning-based pointwise rerankers (ReasonRR) to standard, non-reasoning pointwise rerankers (StandardRR) under identical training conditions, and observe that StandardRR generally outperforms ReasonRR. Building on this observation, we then study the importance of reasoning to ReasonRR by disabling its reasoning process (ReasonRR-NoReason), and find that ReasonRR-NoReason is surprisingly more effective than ReasonRR. Examining the cause of this result, our findings reveal that reasoning-based rerankers are limited by the LLM's reasoning process, which pushes it toward polarized relevance scores and thus fails to consider the partial relevance of passages, a key factor for the accuracy of pointwise rerankers.

摘要

随着推理模型在复杂自然语言任务中的成功应用日益增多，信息检索（IR）领域的研究者开始探索如何将类似的推理能力整合到基于大语言模型（LLM）的段落重排序器中。这些方法通常利用LLM生成显式的逐步推理过程，最终得出相关性预测。但推理是否真的能提升重排序的准确性？本文深入探讨该问题，通过在相同训练条件下比较基于推理的点式重排序器（ReasonRR）与标准非推理点式重排序器（StandardRR），发现StandardRR通常优于ReasonRR。基于此发现，我们进一步通过禁用ReasonRR的推理过程（ReasonRR-NoReason）来研究推理对ReasonRR的重要性，结果出乎意料地显示ReasonRR-NoReason比ReasonRR更有效。通过分析这一现象的原因，我们的研究表明：基于推理的重排序器受限于LLM的推理过程，该过程会推动模型产生极端化的相关性分数，从而无法考虑段落的局部相关性——这正是点式重排序器准确性的关键因素。

CAIN: Hijacking LLM-Humans Conversations via a Two-Stage Malicious System Prompt Generation and Refining Framework

Abstract

arXiv:2505.16888v1 Announce Type: cross Abstract: Large language models (LLMs) have advanced many applications, but are also known to be vulnerable to adversarial attacks. In this work, we introduce a novel security threat: hijacking AI-human conversations by manipulating LLMs' system prompts to produce malicious answers only to specific targeted questions (e.g., "Who should I vote for US President?", "Are Covid vaccines safe?"), while behaving benignly on others. This attack is detrimental as it can enable malicious actors to exercise large-scale information manipulation by spreading harmful but benign-looking system prompts online. To demonstrate such an attack, we develop CAIN, an algorithm that can automatically curate such harmful system prompts for a specific target question in a black-box setting or without the need to access the LLM's parameters. Evaluated on both open-source and commercial LLMs, CAIN demonstrates significant adversarial impact. In untargeted attacks or forcing LLMs to output incorrect answers, CAIN achieves up to 40% F1 degradation on targeted questions while preserving high accuracy on benign inputs. For targeted attacks or forcing LLMs to output specific harmful answers, CAIN achieves over 70% F1 scores on these targeted responses with minimal impact on benign questions. Our results highlight the critical need for enhanced robustness measures to safeguard the integrity and safety of LLMs in real-world applications. All source code will be publicly available.

摘要

大语言模型（LLMs）虽推动了诸多应用发展，但其易受对抗攻击的脆弱性已广为人知。本研究揭示了一种新型安全威胁：通过篡改LLMs系统提示，使其仅针对特定目标问题（如"我该投票给哪位美国总统候选人？""新冠疫苗安全吗？"）生成恶意回答，同时在其他问题上保持正常表现，从而实现AI-人类对话劫持。此类攻击危害显著，恶意行为者可通过传播看似无害的系统提示在线实施大规模信息操控。为验证该攻击可行性，我们开发了CAIN算法——该算法能在黑盒环境下或无需访问LLM参数的情况下，自动为目标问题生成有害系统提示。经开源与商用LLMs测试评估，CAIN展现出显著对抗效果：在非定向攻击（强制LLMs输出错误答案）中，CAIN可使目标问题F1值降低达40%，同时保持对良性输入的高准确率；在定向攻击（强制LLMs输出特定有害答案）中，CAIN对目标响应的F1分数超过70%，且对良性问题影响极小。研究结果凸显了增强LLMs鲁棒性措施的必要性，以保障实际应用中的模型完整性与安全性。所有源代码将公开提供。

Invisible Prompts, Visible Threats: Malicious Font Injection in External Resources for Large Language Models

Abstract

arXiv:2505.16957v1 Announce Type: cross Abstract: Large Language Models (LLMs) are increasingly equipped with capabilities of real-time web search and integrated with protocols like Model Context Protocol (MCP). This extension could introduce new security vulnerabilities. We present a systematic investigation of LLM vulnerabilities to hidden adversarial prompts through malicious font injection in external resources like webpages, where attackers manipulate code-to-glyph mapping to inject deceptive content which are invisible to users. We evaluate two critical attack scenarios: (1) "malicious content relay" and (2) "sensitive data leakage" through MCP-enabled tools. Our experiments reveal that indirect prompts with injected malicious font can bypass LLM safety mechanisms through external resources, achieving varying success rates based on data sensitivity and prompt design. Our research underscores the urgent need for enhanced security measures in LLM deployments when processing external content.

摘要

大语言模型（LLMs）正日益配备实时网络搜索能力，并与模型上下文协议（MCP）等协议集成。这一扩展可能引入新的安全漏洞。我们通过网页等外部资源中的恶意字体注入，系统研究了LLMs对隐藏对抗性提示的脆弱性，攻击者通过操纵代码到字形映射来注入用户不可见的欺骗性内容。我们评估了两种关键攻击场景：（1）通过支持MCP的工具实现的“恶意内容中继”和（2）“敏感数据泄露”。实验表明，带有恶意字体注入的间接提示可通过外部资源绕过LLMs的安全机制，其成功率因数据敏感性和提示设计而异。我们的研究强调了在LLMs处理外部内容时，亟需加强部署安全措施。

Latent Principle Discovery for Language Model Self-Improvement

Abstract

arXiv:2505.16927v1 Announce Type: cross Abstract: When language model (LM) users aim to improve the quality of its generations, it is crucial to specify concrete behavioral attributes that the model should strive to reflect. However, curating such principles across many domains, even non-exhaustively, requires a labor-intensive annotation process. To automate this process, we propose eliciting these latent attributes guiding model reasoning towards human-preferred responses by explicitly modeling them in a self-correction setting. Our approach mines new principles from the LM itself and compresses the discovered elements to an interpretable set via clustering. Specifically, we employ an approximation of posterior-regularized Monte Carlo Expectation-Maximization to both identify a condensed set of the most effective latent principles and teach the LM to strategically invoke them in order to intrinsically refine its responses. We demonstrate that bootstrapping our algorithm over multiple iterations enables smaller language models (7-8B parameters) to self-improve, achieving +8-10% in AlpacaEval win-rate, an average of +0.3 on MT-Bench, and +19-23% in principle-following win-rate on IFEval. We also show that clustering the principles yields interpretable and diverse model-generated constitutions while retaining model performance. The gains our method achieves highlight the potential of automated, principle-driven post-training recipes toward continual self-improvement.

摘要

当语言模型（LM）用户希望提升其生成内容质量时，明确指定模型应遵循的具体行为属性至关重要。然而，跨多个领域（即使非穷尽式地）构建此类原则需要耗费大量人工标注工作。为实现自动化，我们提出通过自我校正框架显式建模这些潜在属性，从而引导模型推理朝向人类偏好的响应。该方法从语言模型自身挖掘新原则，并通过聚类将发现要素压缩为可解释的集合。具体而言，我们采用后验正则化蒙特卡洛期望最大化近似算法，既识别最有效潜在原则的浓缩集合，又教导语言模型策略性地调用这些原则以实现内在响应优化。实验表明，经过多轮迭代的算法自举可使较小规模语言模型（70-80亿参数）实现自我提升：AlpacaEval胜率提升8-10%，MT-Bench平均得分提高0.3，IFEval原则遵循胜率增长19-23%。研究还证明，聚类生成的原则集兼具可解释性与多样性，同时保持模型性能。本方法取得的进展凸显了自动化、原则驱动的训练后优化方案在持续自我改进方面的潜力。

Bottlenecked Transformers: Periodic KV Cache Abstraction for Generalised Reasoning

Abstract

arXiv:2505.16950v1 Announce Type: cross Abstract: Despite their impressive capabilities, Large Language Models struggle with generalisation beyond their training distribution, often exhibiting sophisticated pattern interpolation rather than true abstract reasoning (extrapolation). In this work, we approach this limitation through the lens of Information Bottleneck (IB) theory, which posits that model generalisation emerges from an optimal balance between input compression and retention of predictive information in latent representations. We prove using IB theory that decoder-only Transformers are inherently constrained in their ability to form task-optimal sequence representations. We then use this result to demonstrate that periodic global transformation of the internal sequence-level representations (KV cache) is a necessary computational step for improving Transformer generalisation in reasoning tasks. Based on these theoretical insights, we propose a modification to the Transformer architecture, in the form of an additional module that globally rewrites the KV cache at periodic intervals, shifting its capacity away from memorising input prefixes and toward encoding features most useful for predicting future tokens. Our model delivers substantial gains on mathematical reasoning benchmarks, outperforming both vanilla Transformers with up to 3.5x more parameters, as well as heuristic-driven pruning mechanisms for cache compression. Our approach can be seen as a principled generalisation of existing KV-cache compression methods; whereas such methods focus solely on compressing input representations, they often do so at the expense of retaining predictive information, and thus their capabilities are inherently bounded by those of an unconstrained model. This establishes a principled framework to manipulate Transformer memory using information theory, addressing fundamental reasoning limitations that scaling alone cannot overcome.

摘要

尽管大型语言模型展现出卓越能力，其在训练分布之外的泛化表现仍存在局限，往往呈现复杂的模式插值而非真正的抽象推理（外推）。本研究通过信息瓶颈理论（IB）视角探讨这一限制，该理论认为模型泛化源于潜在表征中输入压缩与预测信息保留之间的最优平衡。我们运用IB理论证明：仅含解码器的Transformer在形成任务最优序列表征方面存在固有约束。基于此发现，我们论证了对内部序列级表征（KV缓存）进行周期性全局变换是提升Transformer推理任务泛化能力的必要计算步骤。根据这些理论洞见，我们提出对Transformer架构的改进方案——通过新增模块定期全局重写KV缓存，将其能力从记忆输入前缀转向编码对未来token预测最有用的特征。该模型在数学推理基准测试中取得显著提升，性能超越参数量达3.5倍的原始Transformer，以及采用启发式剪枝机制的缓存压缩方法。我们的方法可视为现有KV缓存压缩技术的原理性泛化：此类方法仅聚焦于压缩输入表征，却常以牺牲预测信息为代价，因此其能力本质上受限于无约束模型。这建立了一个基于信息论调控Transformer记忆的原理性框架，解决了仅靠规模扩展无法克服的根本性推理缺陷。

$\text{R}^2\text{ec}$ : Towards Large Recommender Models with Reasoning

Abstract

arXiv:2505.16994v1 Announce Type: cross Abstract: Large recommender models have extended LLMs as powerful recommenders via encoding or item generation, and recent breakthroughs in LLM reasoning synchronously motivate the exploration of reasoning in recommendation. Current studies usually position LLMs as external reasoning modules to yield auxiliary thought for augmenting conventional recommendation pipelines. However, such decoupled designs are limited in significant resource cost and suboptimal joint optimization. To address these issues, we propose \name, a unified large recommender model with intrinsic reasoning capabilities. Initially, we reconceptualize the model architecture to facilitate interleaved reasoning and recommendation in the autoregressive process. Subsequently, we propose RecPO, a corresponding reinforcement learning framework that optimizes \name\ both the reasoning and recommendation capabilities simultaneously in a single policy update; RecPO introduces a fused reward scheme that solely leverages recommendation labels to simulate the reasoning capability, eliminating dependency on specialized reasoning annotations. Experiments on three datasets with various baselines verify the effectiveness of \name, showing relative improvements of 68.67% in Hit@5 and 45.21% in NDCG@20. Code available at https://github.com/YRYangang/RRec.

摘要

大型推荐模型通过编码或项目生成将LLM扩展为强大的推荐系统，而LLM推理领域的最新突破同步推动了推荐系统中推理能力的探索。现有研究通常将LLM定位为外部推理模块，通过生成辅助思维来增强传统推荐流程。然而，这种解耦设计存在资源成本高昂和联合优化效果欠佳的局限性。为解决这些问题，我们提出\name，一种具备内在推理能力的统一大型推荐模型。首先，我们重构模型架构以实现自回归过程中推理与推荐的交错执行；其次，提出RecPO强化学习框架，通过单次策略更新同步优化\name\的推理与推荐能力。RecPO采用融合奖励机制，仅利用推荐标签模拟推理能力，无需依赖专门的推理标注。在三个数据集上的多基线实验验证了\name\的有效性，Hit@5和NDCG@20指标分别实现68.67%和45.21%的相对提升。代码详见https://github.com/YRYangang/RRec。

T1: A Tool-Oriented Conversational Dataset for Multi-Turn Agentic Planning

Abstract

arXiv:2505.16986v1 Announce Type: cross Abstract: Large Language Models (LLMs) have demonstrated impressive capabilities as intelligent agents capable of solving complex problems. However, effective planning in scenarios involving dependencies between API or tool calls-particularly in multi-turn conversations-remains a significant challenge. To address this, we introduce T1, a tool-augmented, multi-domain, multi-turn conversational dataset specifically designed to capture and manage inter-tool dependencies across diverse domains. T1 enables rigorous evaluation of agents' ability to coordinate tool use across nine distinct domains (4 single domain and 5 multi-domain) with the help of an integrated caching mechanism for both short- and long-term memory, while supporting dynamic replanning-such as deciding whether to recompute or reuse cached results. Beyond facilitating research on tool use and planning, T1 also serves as a benchmark for evaluating the performance of open-source language models. We present results powered by T1-Agent, highlighting their ability to plan and reason in complex, tool-dependent scenarios.

摘要

大型语言模型（LLMs）作为能够解决复杂问题的智能代理，已展现出令人印象深刻的能力。然而，在涉及API或工具调用之间存在依赖关系的场景中——尤其是多轮对话情境下——如何进行有效规划仍是一个重大挑战。为此，我们提出了T1数据集，这是一个工具增强型、多领域、多轮对话数据集，专门设计用于捕捉和管理跨领域工具间的依赖关系。T1通过集成短期与长期记忆缓存机制，支持动态重新规划（如决定重新计算或复用缓存结果），能够严格评估智能代理在九个不同领域（4个单领域和5个多领域）中协调工具使用的能力。除推动工具使用与规划研究外，T1还可作为评估开源语言模型性能的基准。我们展示了基于T1-Agent的实验结果，凸显其在复杂工具依赖场景中的规划与推理能力。

MASLab: A Unified and Comprehensive Codebase for LLM-based Multi-Agent Systems

Abstract

arXiv:2505.16988v1 Announce Type: cross Abstract: LLM-based multi-agent systems (MAS) have demonstrated significant potential in enhancing single LLMs to address complex and diverse tasks in practical applications. Despite considerable advancements, the field lacks a unified codebase that consolidates existing methods, resulting in redundant re-implementation efforts, unfair comparisons, and high entry barriers for researchers. To address these challenges, we introduce MASLab, a unified, comprehensive, and research-friendly codebase for LLM-based MAS. (1) MASLab integrates over 20 established methods across multiple domains, each rigorously validated by comparing step-by-step outputs with its official implementation. (2) MASLab provides a unified environment with various benchmarks for fair comparisons among methods, ensuring consistent inputs and standardized evaluation protocols. (3) MASLab implements methods within a shared streamlined structure, lowering the barriers for understanding and extension. Building on MASLab, we conduct extensive experiments covering 10+ benchmarks and 8 models, offering researchers a clear and comprehensive view of the current landscape of MAS methods. MASLab will continue to evolve, tracking the latest developments in the field, and invite contributions from the broader open-source community.

摘要

基于大语言模型（LLM）的多智能体系统（MAS）在增强单一LLM以解决实际应用中复杂多样任务方面展现出巨大潜力。尽管已取得显著进展，该领域仍缺乏整合现有方法的统一代码库，导致冗余的重复实现、不公平的比较以及研究人员的高入门门槛。为应对这些挑战，我们推出MASLab——一个统一、全面且便于研究的基于LLM的MAS代码库。(1) MASLab整合了跨多个领域的20余种成熟方法，每种方法均通过逐步输出结果与其官方实现的严格比对验证；(2) 提供配备多种基准测试的统一环境，确保方法间输入一致性和标准化评估流程，实现公平比较；(3) 采用共享的模块化结构实现方法，显著降低理解与扩展门槛。基于MASLab，我们开展了覆盖10余个基准测试和8种模型的大规模实验，为研究者提供清晰全面的MAS方法现状概览。MASLab将持续演进，追踪领域最新进展，并欢迎广大开源社区贡献。

MixAT: Combining Continuous and Discrete Adversarial Training for LLMs

Abstract

arXiv:2505.16947v1 Announce Type: cross Abstract: Despite recent efforts in Large Language Models (LLMs) safety and alignment, current adversarial attacks on frontier LLMs are still able to force harmful generations consistently. Although adversarial training has been widely studied and shown to significantly improve the robustness of traditional machine learning models, its strengths and weaknesses in the context of LLMs are less understood. Specifically, while existing discrete adversarial attacks are effective at producing harmful content, training LLMs with concrete adversarial prompts is often computationally expensive, leading to reliance on continuous relaxations. As these relaxations do not correspond to discrete input tokens, such latent training methods often leave models vulnerable to a diverse set of discrete attacks. In this work, we aim to bridge this gap by introducing MixAT, a novel method that combines stronger discrete and faster continuous attacks during training. We rigorously evaluate MixAT across a wide spectrum of state-of-the-art attacks, proposing the At Least One Attack Success Rate (ALO-ASR) metric to capture the worst-case vulnerability of models. We show MixAT achieves substantially better robustness (ALO-ASR < 20%) compared to prior defenses (ALO-ASR > 50%), while maintaining a runtime comparable to methods based on continuous relaxations. We further analyze MixAT in realistic deployment settings, exploring how chat templates, quantization, low-rank adapters, and temperature affect both adversarial training and evaluation, revealing additional blind spots in current methodologies. Our results demonstrate that MixAT's discrete-continuous defense offers a principled and superior robustness-accuracy tradeoff with minimal computational overhead, highlighting its promise for building safer LLMs. We provide our code and models at https://github.com/insait-institute/MixAT.

摘要

尽管近年来在大语言模型（LLMs）安全性与对齐方面做出了诸多努力，当前针对前沿大语言模型的对抗攻击仍能持续诱导有害内容生成。虽然对抗训练已被广泛研究并证明能显著提升传统机器学习模型的鲁棒性，但其在LLM领域的优势与局限尚未得到充分理解。具体而言，尽管现有离散对抗攻击能有效生成有害内容，但使用具体对抗提示训练LLM通常计算成本高昂，导致研究者不得不依赖连续松弛方法。由于这些松弛处理与离散输入标记并不对应，此类潜在训练方法往往使模型容易受到多种离散攻击的影响。本研究通过提出MixAT方法来弥合这一鸿沟——该创新方法在训练过程中融合了更强离散攻击与更快连续攻击。我们通过最先进的攻击谱系进行严格评估，并提出"至少一次攻击成功率"（ALO-ASR）指标来捕捉模型的最坏漏洞情况。实验表明MixAT实现了显著优于现有防御方案（ALO-ASR>50%）的鲁棒性（ALO-ASR<20%），同时保持与连续松弛方法相当的运行时间。我们进一步分析了MixAT在实际部署场景中的表现，探究聊天模板、量化处理、低秩适配器和温度参数如何影响对抗训练与评估，从而揭示现有方法论的额外盲区。研究结果证明MixAT的离散-连续联合防御机制能以最小计算开销提供更优的鲁棒性-准确性权衡，为构建更安全的大语言模型提供了新思路。代码与模型已开源：https://github.com/insait-institute/MixAT。

Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval

Abstract

arXiv:2505.16967v1 Announce Type: cross Abstract: Training robust retrieval and reranker models typically relies on large-scale retrieval datasets; for example, the BGE collection contains 1.6 million query-passage pairs sourced from various data sources. However, we find that certain datasets can negatively impact model effectiveness -- pruning 8 out of 15 datasets from the BGE collection reduces the training set size by 2.35 $\times$ and increases nDCG@10 on BEIR by 1.0 point. This motivates a deeper examination of training data quality, with a particular focus on "false negatives", where relevant passages are incorrectly labeled as irrelevant. We propose a simple, cost-effective approach using cascading LLM prompts to identify and relabel hard negatives. Experimental results show that relabeling false negatives with true positives improves both E5 (base) and Qwen2.5-7B retrieval models by 0.7-1.4 nDCG@10 on BEIR and by 1.7-1.8 nDCG@10 on zero-shot AIR-Bench evaluation. Similar gains are observed for rerankers fine-tuned on the relabeled data, such as Qwen2.5-3B on BEIR. The reliability of the cascading design is further supported by human annotation results, where we find judgment by GPT-4o shows much higher agreement with humans than GPT-4o-mini.

摘要

训练稳健的检索和重排序模型通常依赖于大规模检索数据集；例如，BGE集合包含来自不同数据源的160万查询-段落对。然而，我们发现某些数据集可能对模型效果产生负面影响——从BGE集合中剔除15个数据集中的8个，可使训练集规模缩小2.35倍，同时使BEIR上的nDCG@10提升1.0分。这一现象促使我们对训练数据质量进行深入分析，尤其关注被错误标注为不相关的"假阴性"相关段落。我们提出了一种简单、经济高效的方法，通过级联LLM提示来识别并重新标注困难负样本。实验结果表明，用真实正样本重新标注假阴性后，E5（基础版）和Qwen2.5-7B检索模型在BEIR上的nDCG@10提升了0.7-1.4分，在零样本AIR-Bench评估中提升了1.7-1.8分。基于重标注数据微调的重排序模型（如BEIR上的Qwen2.5-3B）也观察到类似的提升效果。级联设计的可靠性进一步得到人工标注结果的验证：GPT-4o的判断与人类标注者的一致性显著高于GPT-4o-mini。

R1-Searcher++: Incentivizing the Dynamic Knowledge Acquisition of LLMs via Reinforcement Learning

Abstract

arXiv:2505.17005v1 Announce Type: cross Abstract: Large Language Models (LLMs) are powerful but prone to hallucinations due to static knowledge. Retrieval-Augmented Generation (RAG) helps by injecting external information, but current methods often are costly, generalize poorly, or ignore the internal knowledge of the model. In this paper, we introduce R1-Searcher++, a novel framework designed to train LLMs to adaptively leverage both internal and external knowledge sources. R1-Searcher++ employs a two-stage training strategy: an initial SFT Cold-start phase for preliminary format learning, followed by RL for Dynamic Knowledge Acquisition. The RL stage uses outcome-supervision to encourage exploration, incorporates a reward mechanism for internal knowledge utilization, and integrates a memorization mechanism to continuously assimilate retrieved information, thereby enriching the model's internal knowledge. By leveraging internal knowledge and external search engine, the model continuously improves its capabilities, enabling efficient retrieval-augmented reasoning. Our experiments demonstrate that R1-Searcher++ outperforms previous RAG and reasoning methods and achieves efficient retrieval. The code is available at https://github.com/RUCAIBox/R1-Searcher-plus.

摘要

大语言模型（LLMs）虽功能强大，但因静态知识易产生幻觉。检索增强生成（RAG）通过注入外部信息缓解此问题，但现有方法常存在成本高、泛化性差或忽视模型内部知识的缺陷。本文提出R1-Searcher++，一种创新框架，旨在训练LLMs自适应地融合内部与外部知识源。该框架采用两阶段训练策略：先通过监督微调冷启动阶段进行初步格式学习，再通过强化学习实现动态知识获取。强化学习阶段采用结果监督激励探索，引入奖励机制促进内部知识利用，并集成记忆机制持续吸收检索信息，从而扩充模型内部知识。通过协同利用内部知识与外部搜索引擎，模型持续提升能力，实现高效检索增强推理。实验表明，R1-Searcher++优于现有RAG与推理方法，且检索效率显著。代码已开源：https://github.com/RUCAIBox/R1-Searcher-plus。

Do Large Language Models Excel in Complex Logical Reasoning with Formal Language?

Abstract

arXiv:2505.16998v1 Announce Type: cross Abstract: Large Language Models (LLMs) have been shown to achieve breakthrough performance on complex logical reasoning tasks. Nevertheless, most existing research focuses on employing formal language to guide LLMs to derive reliable reasoning paths, while systematic evaluations of these capabilities are still limited. In this paper, we aim to conduct a comprehensive evaluation of LLMs across various logical reasoning problems utilizing formal languages. From the perspective of three dimensions, i.e., spectrum of LLMs, taxonomy of tasks, and format of trajectories, our key findings are: 1) Thinking models significantly outperform Instruct models, especially when formal language is employed; 2) All LLMs exhibit limitations in inductive reasoning capability, irrespective of whether they use a formal language; 3) Data with PoT format achieves the best generalization performance across other languages. Additionally, we also curate the formal-relative training data to further enhance the small language models, and the experimental results indicate that a simple rejected fine-tuning method can better enable LLMs to generalize across formal languages and achieve the best overall performance. Our codes and reports are available at https://github.com/jiangjin1999/FormalEval.

摘要

大型语言模型（LLMs）已被证实在复杂逻辑推理任务上能实现突破性表现。然而，现有研究大多集中于使用形式化语言引导LLMs推导可靠推理路径，对这些能力的系统性评估仍显不足。本文旨在利用形式化语言，对LLMs在各类逻辑推理问题上的表现进行全面评估。从LLMs的频谱、任务分类学以及推理轨迹格式三个维度出发，我们的主要发现包括：1）思维模型显著优于指令模型，尤其在采用形式化语言时；2）所有LLMs均表现出归纳推理能力的局限性，无论是否使用形式化语言；3）采用PoT格式的数据在其他语言中展现出最佳泛化性能。此外，我们还构建了形式化相关训练数据以进一步提升小语言模型，实验结果表明，简单的拒绝微调方法能更好地使LLMs跨形式化语言泛化，并取得最佳综合性能。代码与报告详见https://github.com/jiangjin1999/FormalEval。

Understanding Prompt Tuning and In-Context Learning via Meta-Learning

Abstract

arXiv:2505.17010v1 Announce Type: cross Abstract: Prompting is one of the main ways to adapt a pretrained model to target tasks. Besides manually constructing prompts, many prompt optimization methods have been proposed in the literature. Method development is mainly empirically driven, with less emphasis on a conceptual understanding of prompting. In this paper we discuss how optimal prompting can be understood through a Bayesian view, which also implies some fundamental limitations of prompting that can only be overcome by tuning weights. The paper explains in detail how meta-trained neural networks behave as Bayesian predictors over the pretraining distribution, whose hallmark feature is rapid in-context adaptation. Optimal prompting can be studied formally as conditioning these Bayesian predictors, yielding criteria for target tasks where optimal prompting is and is not possible. We support the theory with educational experiments on LSTMs and Transformers, where we compare different versions of prefix-tuning and different weight-tuning methods. We also confirm that soft prefixes, which are sequences of real-valued vectors outside the token alphabet, can lead to very effective prompts for trained and even untrained networks by manipulating activations in ways that are not achievable by hard tokens. This adds an important mechanistic aspect beyond the conceptual Bayesian theory.

摘要

提示是使预训练模型适应目标任务的主要方式之一。除了人工构建提示外，文献中已提出多种提示优化方法。当前方法开发主要基于经验驱动，对提示机制的概念性理解关注较少。本文通过贝叶斯视角阐释最优提示的理论基础，同时揭示了仅通过提示无法克服、必须依赖权重调优的根本性局限。论文详细论证了元训练神经网络如何作为预训练分布上的贝叶斯预测器运作，其核心特征在于快速的上下文适应能力。最优提示可形式化地研究为对这些贝叶斯预测器的条件约束，由此推导出提示优化可行与不可行的目标任务判定准则。我们通过LSTM和Transformer的示教实验验证理论，比较不同前缀调优方法与权重调优方法的性能。实验同时证实：软前缀（即超出词表范围的实值向量序列）能通过硬标记无法实现的激活操控机制，为已训练甚至未训练网络生成高效提示。这一发现为概念性贝叶斯理论补充了重要的机理维度。

SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding

Abstract

arXiv:2505.17012v1 Announce Type: cross Abstract: Multimodal large language models (MLLMs) have achieved impressive success in question-answering tasks, yet their capabilities for spatial understanding are less explored. This work investigates a critical question: do existing MLLMs possess 3D spatial perception and understanding abilities? Concretely, we make the following contributions in this paper: (i) we introduce VGBench, a benchmark specifically designed to assess MLLMs for visual geometry perception, e.g., camera pose and motion estimation; (ii) we propose SpatialScore, the most comprehensive and diverse multimodal spatial understanding benchmark to date, integrating VGBench with relevant data from the other 11 existing datasets. This benchmark comprises 28K samples across various spatial understanding tasks, modalities, and QA formats, along with a carefully curated challenging subset, SpatialScore-Hard; (iii) we develop SpatialAgent, a novel multi-agent system incorporating 9 specialized tools for spatial understanding, supporting both Plan-Execute and ReAct reasoning paradigms; (iv) we conduct extensive evaluations to reveal persistent challenges in spatial reasoning while demonstrating the effectiveness of SpatialAgent. We believe SpatialScore will offer valuable insights and serve as a rigorous benchmark for the next evolution of MLLMs.

摘要

多模态大语言模型（MLLMs）在问答任务中取得了显著成功，但其空间理解能力尚未得到充分探索。本研究探讨了一个关键问题：现有MLLMs是否具备三维空间感知与理解能力？具体而言，本文作出以下贡献：（i）提出VGBench基准测试，专门用于评估MLLMs的视觉几何感知能力（如相机位姿与运动估计）；（ii）构建迄今为止最全面、最多元的多模态空间理解基准SpatialScore，整合VGBench与其他11个现有数据集的相关数据。该基准包含28K个样本，涵盖多种空间理解任务、模态和问答形式，并精心筛选出高难度子集SpatialScore-Hard；（iii）开发新型多智能体系统SpatialAgent，集成9种空间理解专用工具，支持"规划-执行"与"反应式"两种推理范式；（iv）通过广泛实验揭示空间推理中的持续挑战，同时验证SpatialAgent的有效性。我们相信SpatialScore将为MLLMs的下一代演进提供重要洞见与严格评估标准。

Let Androids Dream of Electric Sheep: A Human-like Image Implication Understanding and Reasoning Framework

Abstract

arXiv:2505.17019v1 Announce Type: cross Abstract: Metaphorical comprehension in images remains a critical challenge for AI systems, as existing models struggle to grasp the nuanced cultural, emotional, and contextual implications embedded in visual content. While multimodal large language models (MLLMs) excel in basic Visual Question Answer (VQA) tasks, they struggle with a fundamental limitation on image implication tasks: contextual gaps that obscure the relationships between different visual elements and their abstract meanings. Inspired by the human cognitive process, we propose Let Androids Dream (LAD), a novel framework for image implication understanding and reasoning. LAD addresses contextual missing through the three-stage framework: (1) Perception: converting visual information into rich and multi-level textual representations, (2) Search: iteratively searching and integrating cross-domain knowledge to resolve ambiguity, and (3) Reasoning: generating context-alignment image implication via explicit reasoning. Our framework with the lightweight GPT-4o-mini model achieves SOTA performance compared to 15+ MLLMs on English image implication benchmark and a huge improvement on Chinese benchmark, performing comparable with the GPT-4o model on Multiple-Choice Question (MCQ) and outperforms 36.7% on Open-Style Question (OSQ). Additionally, our work provides new insights into how AI can more effectively interpret image implications, advancing the field of vision-language reasoning and human-AI interaction. Our project is publicly available at https://github.com/MING-ZCH/Let-Androids-Dream-of-Electric-Sheep.

摘要

图像隐喻理解仍是AI系统面临的关键挑战，现有模型难以把握视觉内容中蕴含的微妙文化、情感与语境含义。尽管多模态大语言模型（MLLMs）在基础视觉问答（VQA）任务中表现优异，但在图像隐含意义任务上存在根本性局限：语境缺失导致不同视觉元素与其抽象含义的关联模糊。受人类认知过程启发，我们提出'让安卓做梦'（LAD）框架，通过三阶段架构实现图像隐含意义的理解与推理：（1）感知阶段：将视觉信息转化为丰富的多层级文本表征；（2）检索阶段：迭代搜索并整合跨领域知识以消除歧义；（3）推理阶段：通过显式推理生成语境对齐的图像隐含意义。相较于15余种MLLMs，本框架搭载轻量级GPT-4o-mini模型在英文图像隐含意义基准测试中达到SOTA性能，中文基准测试提升显著，在多选题（MCQ）任务上与GPT-4o模型表现相当，开放式问题（OSQ）任务上性能超出36.7%。本研究为AI更有效解读图像隐含意义提供了新思路，推动了视觉语言推理与人机交互领域的发展。项目已开源：https://github.com/MING-ZCH/Let-Androids-Dream-of-Electric-Sheep。

Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO

Abstract

arXiv:2505.17017v1 Announce Type: cross Abstract: Recent advancements underscore the significant role of Reinforcement Learning (RL) in enhancing the Chain-of-Thought (CoT) reasoning capabilities of large language models (LLMs). Two prominent RL algorithms, Direct Preference Optimization (DPO) and Group Relative Policy Optimization (GRPO), are central to these developments, showcasing different pros and cons. Autoregressive image generation, also interpretable as a sequential CoT reasoning process, presents unique challenges distinct from LLM-based CoT reasoning. These encompass ensuring text-image consistency, improving image aesthetic quality, and designing sophisticated reward models, rather than relying on simpler rule-based rewards. While recent efforts have extended RL to this domain, these explorations typically lack an in-depth analysis of the domain-specific challenges and the characteristics of different RL strategies. To bridge this gap, we provide the first comprehensive investigation of the GRPO and DPO algorithms in autoregressive image generation, evaluating their in-domain performance and out-of-domain generalization, while scrutinizing the impact of different reward models on their respective capabilities. Our findings reveal that GRPO and DPO exhibit distinct advantages, and crucially, that reward models possessing stronger intrinsic generalization capabilities potentially enhance the generalization potential of the applied RL algorithms. Furthermore, we systematically explore three prevalent scaling strategies to enhance both their in-domain and out-of-domain proficiency, deriving unique insights into efficiently scaling performance for each paradigm. We hope our study paves a new path for inspiring future work on developing more effective RL algorithms to achieve robust CoT reasoning in the realm of autoregressive image generation. Code is released at https://github.com/ZiyuGuo99/Image-Generation-CoT

摘要

近期研究进展突显了强化学习（RL）在增强大语言模型（LLMs）思维链（CoT）推理能力中的重要作用。其中直接偏好优化（DPO）和群体相对策略优化（GRPO）作为两种核心RL算法，展现出各自的优势与局限。自回归图像生成可被解读为一种序列化CoT推理过程，但其面临与LLM-based CoT推理截然不同的独特挑战——包括确保文本-图像一致性、提升图像美学质量、设计复杂奖励模型而非依赖简单规则奖励等。尽管已有研究将RL引入该领域，但这些探索通常缺乏对领域特定挑战及不同RL策略特性的深入分析。为填补这一空白，我们首次对GRPO和DPO算法在自回归图像生成中的表现展开全面研究：评估其领域内性能与跨领域泛化能力，同时剖析不同奖励模型对其效能的影响。研究发现GRPO与DPO各具优势，且关键的是，具备更强内在泛化能力的奖励模型可能提升所应用RL算法的泛化潜力。此外，我们系统探索了三种主流扩展策略以增强其领域内外能力，为每种范式的高效性能扩展提供了独到见解。本研究有望为开发更有效RL算法、实现自回归图像生成领域鲁棒CoT推理的未来工作开辟新路径。代码已发布于https://github.com/ZiyuGuo99/Image-Generation-CoT。

FiDeLiS: Faithful Reasoning in Large Language Model for Knowledge Graph Question Answering

Abstract

arXiv:2405.13873v4 Announce Type: replace Abstract: Large Language Models (LLMs) are often challenged by generating erroneous or hallucinated responses, especially in complex reasoning tasks. Leveraging Knowledge Graphs (KGs) as external knowledge sources has emerged as a viable solution. However, existing KG-enhanced methods, either retrieval-based or agent-based, encounter difficulties in accurately retrieving knowledge and efficiently traversing KGs at scale. In this paper, we propose a unified framework, FiDeLiS, designed to improve the factuality of LLM responses by anchoring answers to verifiable reasoning steps retrieved from KGs. To achieve this, we leverage step-wise beam search with a deductive scoring function, allowing the LLM to validate reasoning process step by step, and halt the search once the question is deducible. In addition, we propose a Path-RAG module to pre-select a smaller candidate set for each beam search step, reducing computational costs by narrowing the search space. Extensive experiments show that our method, as a training-free framework, not only improve the performance but also enhance the factuality and interpretability across different benchmarks. Code is released at https://github.com/Y-Sui/FiDeLiS.

摘要

大型语言模型（LLM）在生成复杂推理任务的响应时，常面临错误或幻觉输出的挑战。利用知识图谱（KG）作为外部知识源已成为可行解决方案，但现有基于检索或智能体的KG增强方法难以实现精准知识检索与大规模图谱高效遍历。本文提出统一框架FiDeLiS，通过将答案锚定至从KG检索的可验证推理步骤，提升LLM响应的真实性。该框架采用基于演绎评分函数的逐步束搜索技术，使LLM能逐步骤验证推理过程，并在问题可推导时终止搜索。此外，我们提出Path-RAG模块为每个束搜索步骤预选小型候选集，通过缩小搜索空间降低计算成本。大量实验表明，该免训练框架不仅提升了性能，还在多个基准测试中增强了输出的真实性与可解释性。代码发布于https://github.com/Y-Sui/FiDeLiS。

Judgment-of-Thought Prompting: A Courtroom-Inspired Framework for Binary Logical Reasoning with Large Language Models

Abstract

arXiv:2409.16635v2 Announce Type: replace Abstract: This paper proposes a novel prompting approach, Judgment of Thought (JoT), specifically tailored for binary logical reasoning tasks. Despite advances in prompt engineering, existing approaches still face limitations in handling complex logical reasoning tasks. To address these issues, JoT introduces a multi-agent approach with three specialized roles $\unicode{x2010}$$\unicode{x2010}$$\unicode{x2010}$ lawyer, prosecutor, and judge $\unicode{x2010}$$\unicode{x2010}$$\unicode{x2010}$ where a high-level model acts as the judge, and lower-level models serve as lawyer and prosecutor to systematically debate and evaluate arguments. Experimental evaluations on benchmarks such as BigBenchHard and Winogrande demonstrate JoT's superior performance compared to existing prompting approaches, achieving notable improvements, including 98% accuracy in Boolean expressions. Also, our ablation studies validate the critical contribution of each role, iterative refinement loops, and feedback mechanisms. Consequently, JoT significantly enhances accuracy, reliability, and consistency in binary reasoning tasks and shows potential for practical applications.

摘要

本文提出了一种针对二元逻辑推理任务的新型提示方法——思维裁决(Judgment of Thought, JoT)。尽管提示工程领域已取得进展，现有方法在处理复杂逻辑推理任务时仍存在局限。为解决这些问题，JoT采用多智能体架构，设置律师、检察官和法官三个专门角色：由高层模型担任法官，底层模型分别作为律师和检察官进行系统性辩论与论证评估。在BigBenchHard和Winogrande等基准测试上的实验表明，JoT相比现有提示方法具有显著优势，其中布尔表达式准确率达到98%。消融研究验证了各角色职责、迭代优化循环及反馈机制的关键作用。研究表明，JoT能显著提升二元推理任务的准确性、可靠性和一致性，具备实际应用潜力。

Bias Amplification: Large Language Models as Increasingly Biased Media

Abstract

arXiv:2410.15234v3 Announce Type: replace Abstract: Model collapse, a phenomenon characterized by performance degradation due to iterative training on synthetic data, has been widely studied. However, its implications for bias amplification, the progressive intensification of pre-existing societal biases in Large Language Models (LLMs), remain significantly underexplored, despite the growing influence of LLMs in shaping online discourse. In this paper, we introduce a open, generational, and long-context benchmark specifically designed to measure political bias amplification in LLMs, leveraging sentence continuation tasks derived from a comprehensive dataset of U.S. political news. Our empirical study using GPT-2 reveals consistent and substantial political bias intensification (e.g., right-leaning amplification) over iterative synthetic training cycles. We evaluate three mitigation strategies, Overfitting, Preservation, and Accumulation, and demonstrate that bias amplification persists independently of model collapse, even when the latter is effectively controlled. Furthermore, we propose a mechanistic analysis approach that identifies neurons correlated with specific phenomena during inference through regression and statistical tests. This analysis uncovers largely distinct neuron populations driving bias amplification and model collapse, underscoring fundamentally different underlying mechanisms. Finally, we supplement our empirical findings with theoretical intuition that explains the separate origins of these phenomena, guiding targeted strategies for bias mitigation.

摘要

模型崩溃是指因迭代训练合成数据导致的性能退化现象，其研究已较为广泛。然而，尽管大语言模型（LLMs）对线上话语的影响日益增强，模型崩溃在偏见放大（即LLMs中既有社会偏见的渐进性强化）方面的作用仍存在显著研究空白。本文提出一个开放、跨代且支持长文本的基准测试，专门用于衡量LLMs中的政治偏见放大效应，该方法基于美国政治新闻数据集构建的句子续写任务展开。通过GPT-2的实证研究，我们发现迭代合成训练周期会导致政治偏见持续显著加剧（例如右倾倾向强化）。我们评估了过拟合、保持和累积三种缓解策略，证明即使有效控制模型崩溃，偏见放大仍会独立存在。进一步，我们提出一种机制分析方法，通过回归和统计检验识别推理过程中与特定现象相关的神经元。该分析揭示了驱动偏见放大和模型崩溃的神经元群体具有显著差异性，表明二者存在根本不同的内在机制。最后，我们通过理论解释补充实证发现，阐明这两种现象的独立起源，从而为针对性偏见缓解策略提供指导。

HyperGraphRAG: Retrieval-Augmented Generation via Hypergraph-Structured Knowledge Representation

Abstract

arXiv:2503.21322v2 Announce Type: replace Abstract: Standard Retrieval-Augmented Generation (RAG) relies on chunk-based retrieval, whereas GraphRAG advances this approach by graph-based knowledge representation. However, existing graph-based RAG approaches are constrained by binary relations, as each edge in an ordinary graph connects only two entities, limiting their ability to represent the n-ary relations (n >= 2) in real-world knowledge. In this work, we propose HyperGraphRAG, a novel hypergraph-based RAG method that represents n-ary relational facts via hyperedges, and consists of knowledge hypergraph construction, retrieval, and generation. Experiments across medicine, agriculture, computer science, and law demonstrate that HyperGraphRAG outperforms both standard RAG and previous graph-based RAG methods in answer accuracy, retrieval efficiency, and generation quality.

摘要

标准检索增强生成（RAG）通常依赖于基于文本块的检索方式，而GraphRAG通过基于图的知识表示方法改进了这一技术。然而，现有基于图的RAG方法受限于二元关系，因为普通图中的每条边仅能连接两个实体，无法有效表示现实知识中的n元关系（n≥2）。本研究提出HyperGraphRAG，这是一种基于超图的新型RAG方法，通过超边表示n元关系事实，包含知识超图构建、检索和生成三个模块。在医学、农业、计算机科学和法律领域的实验表明，HyperGraphRAG在回答准确性、检索效率和生成质量方面均优于标准RAG及以往基于图的RAG方法。

Meta-Reasoner: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models

Abstract

arXiv:2502.19918v2 Announce Type: replace Abstract: Large Language Models (LLMs) increasingly rely on prolonged reasoning chains to solve complex tasks. However, this trial-and-error approach often leads to high computational overhead and error propagation, where early mistakes can derail subsequent steps. To address these issues, we introduce Meta-Reasoner, a framework that dynamically optimizes inference-time reasoning by enabling LLMs to \enquote{think about how to think.} Drawing inspiration from human meta-cognition and dual-process theory, Meta-Reasoner operates as a strategic advisor, decoupling high-level guidance from step-by-step generation. It employs contextual multi-armed bandits to iteratively evaluate reasoning progress and select optimal strategies (e.g., backtrack, clarify ambiguity, restart from scratch, or propose alternative approaches), and reallocates computational resources toward the most promising paths. Our evaluations on mathematical reasoning and puzzles highlight the potential of dynamic reasoning chains to overcome inherent challenges in the LLM reasoning process and also show promise in broader applications, offering a scalable and adaptable solution for reasoning-intensive tasks.

摘要

大型语言模型（LLMs）日益依赖冗长的推理链来解决复杂任务。然而，这种试错方法常导致高昂的计算开销和错误传播问题——早期错误可能使后续步骤偏离正轨。为解决这些问题，我们提出元推理框架（Meta-Reasoner），通过使LLMs能够"思考如何思考"来动态优化推理过程。该框架受人类元认知和双过程理论启发，作为战略顾问将高层指导与逐步生成解耦，采用情境多臂老虎机机制迭代评估推理进度，选择最优策略（如回溯、澄清歧义、重启或提出替代方案），并将计算资源重新分配给最具潜力的路径。在数学推理和谜题求解上的实验表明，动态推理链能有效克服LLM推理过程的固有挑战，其可扩展和自适应的特性也为推理密集型任务提供了更广阔的应用前景。

To Code or not to Code? Adaptive Tool Integration for Math Language Models via Expectation-Maximization

Abstract

arXiv:2502.00691v3 Announce Type: replace Abstract: Recent advances in mathematical problem-solving with language models (LMs) integrate chain-of-thought (CoT) reasoning and code execution to harness their complementary strengths. However, existing hybrid frameworks exhibit a critical limitation: they depend on externally dictated instructions or rigid code-integration templates, lacking metacognitive awareness -- the capacity to dynamically evaluate intrinsic capabilities and autonomously determine when and how to integrate tools. This rigidity motivates our study of autonomous code integration, enabling models to adapt tool-usage strategies as their reasoning abilities evolve during training. While reinforcement learning (RL) shows promise for boosting LLM reasoning at scale (e.g., DeepSeek-R1), we demonstrate its inefficiency in learning autonomous code integration due to inadequate exploration of the vast combinatorial space of CoT-code interleaving patterns. To address this challenge, we propose a novel Expectation-Maximization (EM) framework that synergizes structured exploration (E-step) with off-policy RL optimization (M-step), creating a self-reinforcing cycle between metacognitive tool-use decisions and evolving capabilities. Experiments reveal our method achieves superior results through improved exploration. Notably, our 7B model improves over 11% on MATH500 and 9.4% on AIME without o1-like CoT.

摘要

近期在语言模型（LMs）数学问题求解方面的进展，通过整合思维链（CoT）推理与代码执行，以利用二者的互补优势。然而，现有混合框架存在一个关键局限：它们依赖外部指令或僵化的代码集成模板，缺乏元认知意识——即动态评估内在能力并自主决定何时及如何整合工具的能力。这种僵化性促使我们研究自主代码集成，使模型能够随训练中推理能力的演进而自适应调整工具使用策略。尽管强化学习（RL）在大规模提升LLM推理能力方面展现出潜力（如DeepSeek-R1），我们发现其在学习自主代码集成时效率低下，原因在于对CoT与代码交织模式的庞大组合空间探索不足。为解决这一挑战，我们提出一种新颖的期望最大化（EM）框架，将结构化探索（E步）与离策略RL优化（M步）协同结合，在元认知工具使用决策与演进能力间形成自我强化循环。实验表明，该方法通过改进探索实现了更优结果。值得注意的是，我们的7B模型在MATH500上提升超过11%，在AIME上提升9.4%，且无需类o1的CoT。

QLLM: Do We Really Need a Mixing Network for Credit Assignment in Multi-Agent Reinforcement Learning?

Abstract

arXiv:2504.12961v2 Announce Type: replace Abstract: Credit assignment has remained a fundamental challenge in multi-agent reinforcement learning (MARL). Previous studies have primarily addressed this issue through value decomposition methods under the centralized training with decentralized execution paradigm, where neural networks are utilized to approximate the nonlinear relationship between individual Q-values and the global Q-value. Although these approaches have achieved considerable success in various benchmark tasks, they still suffer from several limitations, including imprecise attribution of contributions, limited interpretability, and poor scalability in high-dimensional state spaces. To address these challenges, we propose a novel algorithm, \textbf{QLLM}, which facilitates the automatic construction of credit assignment functions using large language models (LLMs). Specifically, the concept of \textbf{TFCAF} is introduced, wherein the credit allocation process is represented as a direct and expressive nonlinear functional formulation. A custom-designed \textit{coder-evaluator} framework is further employed to guide the generation, verification, and refinement of executable code by LLMs, significantly mitigating issues such as hallucination and shallow reasoning during inference. Extensive experiments conducted on several standard MARL benchmarks demonstrate that the proposed method consistently outperforms existing state-of-the-art baselines. Moreover, QLLM exhibits strong generalization capability and maintains compatibility with a wide range of MARL algorithms that utilize mixing networks, positioning it as a promising and versatile solution for complex multi-agent scenarios.

摘要

信用分配一直是多智能体强化学习（MARL）领域的核心挑战。先前研究主要通过集中训练分散执行范式下的价值分解方法应对该问题，其利用神经网络逼近个体Q值与全局Q值间的非线性关系。尽管这些方法在多种基准任务中取得了显著成果，但仍存在贡献归因不精确、可解释性有限以及高维状态空间扩展性不足等缺陷。针对上述挑战，本文提出创新算法QLLM，通过大型语言模型（LLMs）实现信用分配函数的自动构建。具体而言，我们引入TFCAF概念，将信用分配过程表述为直接且具表达力的非线性函数形式。进一步采用定制化_编码器-评估器_框架引导LLMs生成、验证与优化可执行代码，显著缓解推理过程中的幻觉与浅层推理等问题。在多个标准MARL基准上的大量实验表明，该方法持续优于现有最先进基线模型。此外，QLLM展现出强大的泛化能力，并与各类采用混合网络的MARL算法保持兼容，有望成为复杂多智能体场景中通用性强的解决方案。

How Well Can a Long Sequence Model Model Long Sequences? Comparing Architechtural Inductive Biases on Long-Context Abilities

Abstract

arXiv:2407.08112v3 Announce Type: replace-cross Abstract: Long sequences occur in abundance within real-world scenarios, hence properly modelling them opens numerous down-stream use-cases. Deep neural networks, however, have often struggled with these for a variety of reasons. Recent advances, both in system engineering as well as model design, have enabled the scaling up of model that are purported to support extended context length. In particular, the state-space and linear recurrent neural network families of models hypothetically can entend to infinite sequence lenth. However, is this too good to be true? We conduct an evaluation to show that while such claims may be sound theoretically, there remain large practical gaps that are empirically observed. In particular, recurrent models still suffer in the same settings as long-context LLMs with attention. We further show that different inductive biases have inconsistent extrapolation capabilities, highlighting the need to further study such paradigms and investigate why long-context models seemingly fail to behave as one might expect.

Worse than Zero-shot? A Fact-Checking Dataset for Evaluating the Robustness of RAG Against Misleading Retrievals

Abstract

arXiv:2502.16101v2 Announce Type: replace Abstract: Retrieval-augmented generation (RAG) has shown impressive capabilities in mitigating hallucinations in large language models (LLMs). However, LLMs struggle to handle misleading retrievals and often fail to maintain their own reasoning when exposed to conflicting or selectively-framed evidence, making them vulnerable to real-world misinformation. In such real-world retrieval scenarios, misleading and conflicting information is rampant, particularly in the political domain, where evidence is often selectively framed, incomplete, or polarized. However, existing RAG benchmarks largely assume a clean retrieval setting, where models succeed by accurately retrieving and generating answers from gold-standard documents. This assumption fails to align with real-world conditions, leading to an overestimation of RAG system performance. To bridge this gap, we introduce RAGuard, a fact-checking dataset designed to evaluate the robustness of RAG systems against misleading retrievals. Unlike prior benchmarks that rely on synthetic noise, our dataset constructs its retrieval corpus from Reddit discussions, capturing naturally occurring misinformation. It categorizes retrieved evidence into three types: supporting, misleading, and irrelevant, providing a realistic and challenging testbed for assessing how well RAG systems navigate different retrieval information. Our benchmark experiments reveal that when exposed to misleading retrievals, all tested LLM-powered RAG systems perform worse than their zero-shot baselines (i.e., no retrieval at all), highlighting their susceptibility to noisy environments. To the best of our knowledge, RAGuard is the first benchmark to systematically assess RAG robustness against misleading evidence. We expect this benchmark will drive future research toward improving RAG systems beyond idealized datasets, making them more reliable for real-world applications.

摘要

检索增强生成（RAG）技术在缓解大语言模型（LLM）幻觉方面展现出卓越能力。然而，当面临误导性检索或冲突性、选择性呈现的证据时，LLM往往难以处理误导信息并维持自身推理逻辑，使其在现实世界错误信息面前表现脆弱。尤其在政治领域，检索场景中普遍存在被选择性框定、不完整或极端化的误导与冲突信息。现有RAG基准测试大多基于理想化检索环境，假设模型通过准确检索黄金标准文档生成答案，这种设定与真实条件脱节，导致对RAG系统性能的高估。为弥补这一差距，我们提出RAGuard——专用于评估RAG系统抵御误导性检索鲁棒性的事实核查数据集。不同于依赖合成噪声的既有基准，本数据集从Reddit讨论中构建检索语料库，捕捉自然发生的错误信息。其将检索证据分为支持性、误导性和无关性三类，为评估RAG系统处理不同检索信息的能力提供真实且具有挑战性的测试平台。基准实验表明，当接触误导性检索时，所有测试的LLM驱动RAG系统表现均逊于零样本基线（即完全不检索），凸显其对噪声环境的敏感性。据我们所知，RAGuard是首个系统评估RAG对抗误导证据鲁棒性的基准。我们期待该基准能推动未来研究超越理想化数据集，提升RAG系统在现实应用中的可靠性。

DynaServe: Unified and Elastic Execution for Dynamic Disaggregated LLM Serving

Abstract

arXiv:2504.09285v2 Announce Type: replace Abstract: LLM inference must meet strict latency SLOs (e.g., 100 ms P99 time-between-tokens) while maximizing goodput. Yet, real-world variability in prompt and response lengths skews compute-intensive prefill and memory-bound decode phases, making both colocated (even with chunked prefill) and disaggregated deployments unable to simultaneously deliver low tail latency and high throughput. We introduce DynaServe, a high-performance LLM serving system built atop vLLM that unifies and extends both paradigms for maximizing goodput under SLO constraints, when handling unbalanced and dynamic workloads. It relies on a micro-request abstraction, which arbitrarily splits each request at any token boundary into at most two cooperating segments. A two-level scheduling framework then balances micro-request load across unified GPU instances. The global scheduler rapidly selects per-request split points by considering both the request's prefill/decode time ratio and the current load across GPU instances. The local schedulers on each GPU instance independently form SLO-aware batches, adjusting their composition in response to workload fluctuations, potential latency spikes and per-GPU under/over utilization. On real-world traces, DynaServe boosts the overall serving capacity from 1.15 $\times$ to 3.07 $\times$ , improves goodput by up to 1.91 $\times$ and 1.61 $\times$ , and improves the performance by up to 60% in a hybrid workload under SLO compared to state-of-the-art colocated and disaggregated baselines.

摘要

大型语言模型（LLM）推理必须在满足严格延迟服务等级目标（SLO）（例如100毫秒P99 token间延迟）的同时最大化优质吞吐量。然而，实际应用中提示词和响应长度的不均衡性导致计算密集型的前填充阶段与内存受限的解码阶段资源需求失衡，使得无论是采用共置部署（即使采用分块前填充）还是分离式部署，均无法同时实现低尾延迟与高吞吐量。我们提出DynaServe——一个基于vLLM构建的高性能LLM服务系统，通过统一并扩展两种范式，在处理不均衡动态工作负载时实现SLO约束下的优质吞吐量最大化。该系统采用微请求抽象机制，可在任意token边界将每个请求动态拆分为最多两个协同执行的片段。随后，双层调度框架在统一GPU实例间实现微请求负载均衡：全局调度器通过综合分析请求的前填充/解码时间比及各GPU实例的实时负载，快速确定每个请求的最佳分割点；各GPU实例的本地调度器则自主形成SLO感知的批处理组合，根据工作负载波动、潜在延迟峰值及单GPU利用率不足/过载情况进行动态调整。在实际场景测试中，DynaServe将总体服务容量提升1.15倍至3.07倍，优质吞吐量最高提升1.91倍和1.61倍，在SLO约束下的混合工作负载中性能较现有最优共置与分离式基线提升达60%。

Large Language Models are Miscalibrated In-Context Learners

Abstract

arXiv:2312.13772v3 Announce Type: replace-cross Abstract: When adapting ICL with or without fine-tuning, we are curious about whether the instruction-tuned language model is able to achieve well-calibrated results without suffering from the problem of overconfidence (i.e., miscalibration) considering its strong instruction following ability, especially in such limited data setups. In this work, we deliver an in-depth analysis of the behavior across different choices of learning methods from the perspective of both performance and calibration. Through extensive controlled experiments, we observe that the miscalibration problem exists across all learning methods in low-resource setups. To achieve simultaneous gain for both in-task performance and calibration, we then study the potential of self-ensembling applied at different modeling stages (e.g., variations of in-context examples or variations in prompts or different ensembling strategies) to make the predictions more calibrated and have comparable or even better performance. We find that self-ensembling with max probability produces robust and calibrated predictions. Our work reveals the potential calibration problem of using ICL despite the improvements in task performance and sheds light on which learning paradigm to choose. We also provide practical guidelines for choosing learning paradigms depending on whether the data has been seen by the model before and a worthwhile solution via self-ensembling on how to enhance both task performance and calibration of LMs, which we hope could encourage further study.

摘要

在使用或不使用微调的情况下进行上下文学习（ICL）适配时，我们关注的是考虑到指令调优语言模型强大的指令遵循能力，其是否能在有限数据设置下获得良好校准的结果，而不受过度自信（即错误校准）问题的影响。本研究从性能和校准两个角度，对不同学习方法的选择行为进行了深入分析。通过大量对照实验，我们观察到在低资源设置下，所有学习方法均存在错误校准问题。为实现任务内性能与校准水平的同步提升，我们进一步研究了在不同建模阶段（如上下文示例的变体、提示的变体或不同集成策略）应用自集成技术的潜力，以使预测结果更趋校准且具有可比性甚至更优性能。研究发现，采用最大概率的自集成方法能产生稳健且校准良好的预测。本工作揭示了尽管ICL能提升任务性能，但仍存在潜在校准问题，并为学习范式的选择提供了依据。我们还针对模型是否曾接触过数据的情况，提出了选择学习范式的实用指南，并通过自集成技术提供了提升语言模型任务性能与校准水平的可行方案，以期推动后续研究。

CodeMind: Evaluating Large Language Models for Code Reasoning

Abstract

arXiv:2402.09664v5 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have been widely used to automate programming tasks. Their capabilities have been evaluated by assessing the quality of generated code through tests or proofs. The extent to which they can reason about code is a critical question revealing important insights about their true capabilities. This paper introduces CodeMind, a framework designed to gauge the code reasoning abilities of LLMs through the following explicit and implicit code reasoning tasks: Independent Execution Reasoning (IER), Specification Reasoning (SR) and Dynamic Semantics Reasoning (DSR). The first evaluates the abilities of LLMs to simulate the execution of given inputs to a code and predict the output (IER). The second assesses the abilities of LLMs to incorporate the simulation of test data in the specification into code generation (SR). Finally, CodeMind evaluates LLMs' abilities to understand overall code semantics only given a specific input/output (DSR). Our extensive evaluation of ten LLMs across four widely used benchmarks using CodeMind shows that LLMs, depending on their size and training strategy, can reason about some dynamic aspects of code. However, their performance drops for code with higher complexity, non-trivial logical and arithmetic operators, non-primitive types, and API calls. We show that these reasoning tasks evaluate LLMs differently, and a comprehensive evaluation of code reasoning requires them all. Finally, we show that the performance of LLMs in bug repair is not correlated with any of the code reasoning tasks, and except for advanced frontier models, other LLMs do not incorporate code reasoning when performing bug repair.

摘要

大型语言模型（LLMs）已被广泛应用于编程任务自动化。现有研究主要通过测试或验证生成代码的质量来评估其能力。而模型对代码的推理能力程度，则是揭示其真实性能的关键问题。本文提出CodeMind框架，通过以下显式与隐式代码推理任务来量化LLMs的代码推理能力：独立执行推理（IER）、规范推理（SR）和动态语义推理（DSR）。IER评估模型对给定代码输入执行模拟并预测输出的能力；SR测试模型将规范中测试数据的模拟融入代码生成的能力；DSR则衡量模型仅凭特定输入/输出理解整体代码语义的能力。我们使用CodeMind对十个LLMs在四个常用基准上的评估表明：根据模型规模与训练策略的不同，LLMs能对代码的某些动态特性进行推理，但在处理高复杂度代码、非平凡逻辑/算术运算符、非原始类型及API调用时性能显著下降。这些推理任务对LLMs的评估维度各异，全面评估代码推理能力需涵盖所有任务。最后我们发现，LLMs在缺陷修复任务中的表现与任何代码推理任务均无相关性，且除前沿高级模型外，其他LLMs在执行缺陷修复时并未体现代码推理能力。

Permissive Information-Flow Analysis for Large Language Models

Abstract

arXiv:2410.03055v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) are rapidly becoming commodity components of larger software systems. This poses natural security and privacy problems: poisoned data retrieved from one component can change the model's behavior and compromise the entire system, including coercing the model to spread confidential data to untrusted components. One promising approach is to tackle this problem at the system level via dynamic information flow (aka taint) tracking. Unfortunately, this approach of propagating the most restrictive input label to the output is too conservative for applications where LLMs operate on inputs retrieved from diverse sources. In this paper, we propose a novel, more permissive approach to propagate information flow labels through LLM queries. The key idea behind our approach is to propagate only the labels of the samples that were influential in generating the model output and to eliminate the labels of unnecessary inputs. We implement and investigate the effectiveness of two variations of this approach, based on (i) prompt-based retrieval augmentation, and (ii) a $k$ -nearest-neighbors language model. We compare these with a baseline that uses introspection to predict the output label. Our experimental results in an LLM agent setting show that the permissive label propagator improves over the baseline in more than 85% of the cases, which underscores the practicality of our approach.

摘要

大型语言模型（LLMs）正迅速成为大型软件系统的标准化组件。这带来了天然的安全与隐私问题：从某一组件获取的污染数据可能改变模型行为并危及整个系统，包括迫使模型将机密数据泄露至不可信组件。动态信息流（即污点）追踪作为系统级解决方案展现出潜力，但其将最严格输入标签传播至输出的方式对多源输入的LLM应用而言过于保守。本文提出一种新颖的、更具许可性的信息流标签传播方法。其核心思想是仅传播对模型输出具有实际影响的样本标签，同时消除无关输入的标签。我们基于（i）提示驱动的检索增强和（ii）k近邻语言模型实现了两种变体，并与基于自省预测输出标签的基线方法进行对比。LLM代理环境下的实验表明，在超过85%的案例中，许可性标签传播器的表现优于基线，验证了该方法的实用性。

More Text, Less Point: Towards 3D Data-Efficient Point-Language Understanding

Abstract

arXiv:2408.15966v3 Announce Type: replace-cross Abstract: Enabling Large Language Models (LLMs) to comprehend the 3D physical world remains a significant challenge. Due to the lack of large-scale 3D-text pair datasets, the success of LLMs has yet to be replicated in 3D understanding. In this paper, we rethink this issue and propose a new task: 3D Data-Efficient Point-Language Understanding. The goal is to enable LLMs to achieve robust 3D object understanding with minimal 3D point cloud and text data pairs. To address this task, we introduce GreenPLM, which leverages more text data to compensate for the lack of 3D data. First, inspired by using CLIP to align images and text, we utilize a pre-trained point cloud-text encoder to map the 3D point cloud space to the text space. This mapping leaves us to seamlessly connect the text space with LLMs. Once the point-text-LLM connection is established, we further enhance text-LLM alignment by expanding the intermediate text space, thereby reducing the reliance on 3D point cloud data. Specifically, we generate 6M free-text descriptions of 3D objects, and design a three-stage training strategy to help LLMs better explore the intrinsic connections between different modalities. To achieve efficient modality alignment, we design a zero-parameter cross-attention module for token pooling. Extensive experimental results show that GreenPLM requires only 12% of the 3D training data used by existing state-of-the-art models to achieve superior 3D understanding. Remarkably, GreenPLM also achieves competitive performance using text-only data. The code and weights are available at: https://github.com/TangYuan96/GreenPLM.

摘要

让大语言模型（LLMs）理解三维物理世界仍是一个重大挑战。由于缺乏大规模的三维-文本配对数据集，LLMs的成功尚未在三维理解领域得到复现。本文重新审视这一问题，提出了一项新任务：三维数据高效的点云-语言理解，其目标是使LLMs能够利用极少的三维点云与文本数据对实现稳健的三维物体理解。针对该任务，我们提出了GreenPLM框架，通过利用更多文本来弥补三维数据的不足。首先，受CLIP对齐图像与文本的启发，我们采用预训练的点云-文本编码器将三维点云空间映射到文本空间，从而无缝连接文本空间与LLMs。建立点云-文本-LLM的连接后，我们通过扩展中间文本空间进一步增强文本-LLM对齐，从而降低对三维点云数据的依赖。具体而言，我们生成了600万条三维物体的自由文本描述，并设计了三阶段训练策略以帮助LLMs探索不同模态间的内在联系。为实现高效的模态对齐，我们设计了一个零参数交叉注意力模块用于令牌池化。大量实验结果表明，GreenPLM仅需现有最优模型12%的三维训练数据即可实现更优的三维理解能力。值得注意的是，GreenPLM仅使用文本数据也能取得具有竞争力的性能。代码与权重已开源：https://github.com/TangYuan96/GreenPLM。

Discovering Spoofing Attempts on Language Model Watermarks

Abstract

arXiv:2410.02693v2 Announce Type: replace-cross Abstract: LLM watermarks stand out as a promising way to attribute ownership of LLM-generated text. One threat to watermark credibility comes from spoofing attacks, where an unauthorized third party forges the watermark, enabling it to falsely attribute arbitrary texts to a particular LLM. Despite recent work demonstrating that state-of-the-art schemes are, in fact, vulnerable to spoofing, no prior work has focused on post-hoc methods to discover spoofing attempts. In this work, we for the first time propose a reliable statistical method to distinguish spoofed from genuinely watermarked text, suggesting that current spoofing attacks are less effective than previously thought. In particular, we show that regardless of their underlying approach, all current learning-based spoofing methods consistently leave observable artifacts in spoofed texts, indicative of watermark forgery. We build upon these findings to propose rigorous statistical tests that reliably reveal the presence of such artifacts and thus demonstrate that a watermark has been spoofed. Our experimental evaluation shows high test power across all learning-based spoofing methods, providing insights into their fundamental limitations and suggesting a way to mitigate this threat. We make all our code available at https://github.com/eth-sri/watermark-spoofing-detection .

摘要

大语言模型（LLM）水印作为一种有前景的技术，可用于确认LLM生成文本的所有权。然而，水印可信度面临的一大威胁来自伪造攻击，即未经授权的第三方伪造水印，从而将任意文本虚假归因于特定LLM。尽管近期研究表明现有最先进的水印方案实际上易受伪造攻击，但此前尚未有研究关注于事后检测伪造尝试的方法。本研究首次提出了一种可靠的统计方法，用于区分伪造水印文本与真实水印文本，表明当前伪造攻击的实际效果低于此前预期。具体而言，我们发现无论采用何种底层方法，当前所有基于学习的伪造方法都会在伪造文本中留下可观测的痕迹，这些痕迹可作为水印伪造的证据。基于这些发现，我们提出了严格的统计检验方法，能够可靠地揭示此类痕迹的存在，从而证明水印已被伪造。实验评估表明，我们的方法对所有基于学习的伪造攻击均具有高检测效力，这不仅揭示了这些方法的根本局限性，也为缓解此类威胁提供了可行路径。相关代码已开源：https://github.com/eth-sri/watermark-spoofing-detection。

Domain-Oriented Time Series Inference Agents for Reasoning and Automated Analysis

Abstract

arXiv:2410.04047v4 Announce Type: replace-cross Abstract: Real-world time series inference requires more than point forecasting. It demands multi-step reasoning, constraint handling, domain knowledge incorporation, and domain-specific workflow assembly. Existing time series foundation models are limited to narrow tasks and lack flexibility to generalize across diverse scenarios. On the other hand, large language models (LLMs) struggle with numerical precision. To address these limitations, we introduce TS-Reasoner, a Domain-Oriented Time Series Agent that integrates natural language reasoning with precise numerical execution. TS-Reasoner decomposes natural language instructions into structured workflows composed of statistical, logical, and domain-specific operators, and incorporates a self-refinement mechanism for adaptive execution. We evaluate its capabilities through two axes: basic time series understanding and complex multi-step inference, using the TimeSeriesExam benchmark and a newly constructed dataset. Experimental results show that TS-Reasoner significantly outperforms general-purpose LLMs, highlighting the promise of domain-specialized agents for robust and interpretable time series reasoning.

摘要

现实世界的时间序列推理需求远超点预测范畴，需要具备多步推理、约束处理、领域知识整合及特定领域工作流组装能力。现有时间序列基础模型仅局限于狭窄任务范畴，缺乏跨场景泛化的灵活性；而大型语言模型（LLMs）则存在数值精度不足的问题。为突破这些限制，我们提出TS-Reasoner——一个融合自然语言推理与精确数值执行的面向领域时间序列智能体。该系统将自然语言指令分解为由统计、逻辑及领域专用算子构成的结构化工作流，并引入自适应执行的自优化机制。我们通过TimeSeriesExam基准和新构建的数据集，从基础时序理解与复杂多步推理两个维度评估其性能。实验结果表明TS-Reasoner显著优于通用LLMs，彰显了领域专用智能体在构建鲁棒可解释时序推理系统方面的潜力。

Do Robot Snakes Dream like Electric Sheep? Investigating the Effects of Architectural Inductive Biases on Hallucination

Abstract

arXiv:2410.17477v5 Announce Type: replace-cross Abstract: The growth in prominence of large language models (LLMs) in everyday life can be largely attributed to their generative abilities, yet some of this is also owed to the risks and costs associated with their use. On one front is their tendency to hallucinate false or misleading information, limiting their reliability. On another is the increasing focus on the computational limitations associated with traditional self-attention based LLMs, which has brought about new alternatives, in particular recurrent models, meant to overcome them. Yet it remains uncommon to consider these two concerns simultaneously. Do changes in architecture exacerbate/alleviate existing concerns about hallucinations? Do they affect how and where they occur? Through an extensive evaluation, we study how these architecture-based inductive biases affect the propensity to hallucinate. While hallucination remains a general phenomenon not limited to specific architectures, the situations in which they occur and the ease with which specific types of hallucinations can be induced can significantly differ based on the model architecture. These findings highlight the need for better understanding both these problems in conjunction with each other, as well as consider how to design more universal techniques for handling hallucinations.

摘要

大型语言模型（LLMs）在日常生活中的显著增长很大程度上归功于其生成能力，但部分原因也与其使用风险和成本相关。一方面，这些模型倾向于产生虚假或误导性信息的幻觉，限制了其可靠性。另一方面，人们越来越关注传统基于自注意力机制的LLMs所面临的计算局限性，这催生了新的替代方案（特别是循环模型）以克服这些限制。然而，同时考虑这两个问题的研究仍不多见：架构变化是否会加剧/缓解现有的幻觉问题？它们如何影响幻觉的发生方式和场景？通过广泛评估，我们研究了基于架构的归纳偏差如何影响幻觉倾向。虽然幻觉仍是普遍现象而非特定架构所独有，但其发生情境以及诱发特定类型幻觉的难易程度会因模型架构而显著不同。这些发现强调需要更好地协同理解这两个问题，并考虑如何设计更通用的技术来处理幻觉现象。

Graph-based Confidence Calibration for Large Language Models

Abstract

arXiv:2411.02454v2 Announce Type: replace-cross Abstract: Reliable confidence estimation is essential for enhancing the trustworthiness of large language models (LLMs), especially in high-stakes scenarios. Despite its importance, accurately estimating confidence in LLM responses remains a significant challenge. In this work, we propose using an auxiliary learning model to assess response correctness based on the self-consistency of multiple outputs generated by the LLM. Our method builds a consistency graph to represent the agreement among multiple responses and uses a graph neural network (GNN) to estimate the likelihood that each response is correct. Experiments demonstrate that this method has strong calibration performance on various benchmark datasets and generalizes well to out-of-domain cases.

摘要

可靠置信度估计对于增强大语言模型（LLMs）的可信度至关重要，尤其是在高风险场景中。尽管其重要性显著，但准确评估LLM响应的置信度仍存在重大挑战。本研究提出采用辅助学习模型，基于LLM生成多输出的自一致性来评估响应正确性。我们的方法构建一致性图以表征多响应间的共识，并利用图神经网络（GNN）估算每个响应正确的概率。实验表明，该方法在多个基准数据集上具有强校准性能，并能良好泛化至域外案例。

Model-based Large Language Model Customization as Service

Abstract

arXiv:2410.10481v3 Announce Type: replace-cross Abstract: Prominent Large Language Model (LLM) services from providers like OpenAI and Google excel at general tasks but often underperform on domain-specific applications. Current customization services for these LLMs typically require users to upload data for fine-tuning, posing significant privacy risks. While differentially private (DP) data synthesis presents a potential alternative, its application commonly results in low effectiveness due to the introduction of excessive noise on data for DP. To overcome this, we introduce Llamdex, a novel framework that facilitates LLM customization as a service, where the client uploads pre-trained domain-specific models rather than data. This client-uploaded model, optionally protected by DP with much lower noise, is inserted into the base LLM via connection modules. Significantly, these connecting modules are trained without requiring sensitive domain data, enabling clients to customize LLM services while preserving data privacy. Experiments demonstrate that Llamdex improves domain-specific accuracy by up to 26% over state-of-the-art private data synthesis methods under identical privacy constraints and, by obviating the need for users to provide domain context within queries, maintains inference efficiency comparable to the original LLM service.

摘要

OpenAI和谷歌等提供商推出的知名大语言模型（LLM）服务在通用任务上表现优异，但在特定领域应用中往往表现欠佳。当前针对这些LLM的定制化服务通常要求用户上传数据进行微调，这带来了显著的隐私风险。虽然差分隐私（DP）数据合成提供了一种潜在替代方案，但由于需为DP保护引入过多数据噪声，其应用效果普遍较差。为此，我们提出了Llamdex这一新型框架，将LLM定制转化为服务模式——客户端上传预训练的领域专用模型而非原始数据。这些客户端上传的模型可选择采用噪声量大幅降低的DP保护，并通过连接模块嵌入基础LLM中。值得注意的是，这些连接模块的训练无需敏感领域数据，使得客户能在保护数据隐私的同时定制LLM服务。实验表明，在相同隐私约束条件下，Llamdex相比最先进的私有数据合成方法将领域准确率最高提升26%，且由于用户无需在查询中提供领域上下文，其推理效率与原始LLM服务相当。

Long-Form Text-to-Music Generation with Adaptive Prompts: A Case Study in Tabletop Role-Playing Games Soundtracks

Abstract

arXiv:2411.03948v3 Announce Type: replace-cross Abstract: This paper investigates the capabilities of text-to-audio music generation models in producing long-form music with prompts that change over time, focusing on soundtrack generation for Tabletop Role-Playing Games (TRPGs). We introduce Babel Bardo, a system that uses Large Language Models (LLMs) to transform speech transcriptions into music descriptions for controlling a text-to-music model. Four versions of Babel Bardo were compared in two TRPG campaigns: a baseline using direct speech transcriptions, and three LLM-based versions with varying approaches to music description generation. Evaluations considered audio quality, story alignment, and transition smoothness. Results indicate that detailed music descriptions improve audio quality while maintaining consistency across consecutive descriptions enhances story alignment and transition smoothness.

摘要

本文研究了文本到音频音乐生成模型在随时间变化提示下生成长篇音乐的能力，重点关注桌面角色扮演游戏(TRPG)配乐生成。我们介绍了Babel Bardo系统，该系统利用大型语言模型(LLM)将语音转录文本转化为音乐描述，用以控制文本到音乐模型。研究在两次TRPG战役中比较了四个版本的Babel Bardo：使用直接语音转录的基线版本，以及三种采用不同音乐描述生成方法的LLM版本。评估指标包括音频质量、故事契合度和过渡平滑性。结果表明，详细的音乐描述能提高音频质量，而保持连续描述间的一致性则可增强故事契合度和过渡平滑性。

Can Knowledge Graphs Make Large Language Models More Trustworthy? An Empirical Study Over Open-ended Question Answering

Abstract

arXiv:2410.08085v4 Announce Type: replace-cross Abstract: Recent works integrating Knowledge Graphs (KGs) have shown promising improvements in enhancing the reasoning capabilities of Large Language Models (LLMs). However, existing benchmarks primarily focus on closed-ended tasks, leaving a gap in evaluating performance on more complex, real-world scenarios. This limitation also hinders a thorough assessment of KGs' potential to reduce hallucinations in LLMs. To address this, we introduce OKGQA, a new benchmark specifically designed to evaluate LLMs augmented with KGs in open-ended, real-world question answering settings. OKGQA reflects practical complexities through diverse question types and incorporates metrics to quantify both hallucination rates and reasoning improvements in LLM+KG models. To consider the scenarios in which KGs may contain varying levels of errors, we propose a benchmark variant, OKGQA-P, to assess model performance when the semantics and structure of KGs are deliberately perturbed and contaminated. In this paper, we aims to (1) explore whether KGs can make LLMs more trustworthy in an open-ended setting, and (2) conduct a comparative analysis to shed light on method design. We believe this study can facilitate a more complete performance comparison and encourages continuous improvement in integrating KGs with LLMs to mitigate hallucination, and make LLMs more trustworthy. Code and data are released at https://github.com/Y-Sui/OKGQA.

摘要

近期研究显示，整合知识图谱（KGs）能显著提升大语言模型（LLMs）的推理能力。然而现有基准测试主要针对封闭式任务，难以评估模型在复杂现实场景中的表现，也阻碍了对知识图谱降低大语言模型幻觉潜力的全面考察。为此，我们提出OKGQA这一新型基准测试，专门用于评估知识图谱增强的大语言模型在开放式现实问答场景中的表现。该基准通过多样化题型反映实际复杂性，并引入量化指标同时测量LLM+KG模型的幻觉率和推理改进。考虑到知识图谱可能存在不同程度错误，我们进一步提出OKGQA-P变体，用于评估图谱语义和结构受到故意干扰时的模型表现。本研究旨在：（1）探究开放场景中知识图谱能否提升大语言模型的可信度；（2）通过对比分析为方法设计提供启示。我们相信该研究能促进更全面的性能比较，推动知识图谱与大语言模型的持续融合以减轻幻觉问题，增强模型可信度。代码与数据已发布于https://github.com/Y-Sui/OKGQA。

GLEE: A Unified Framework and Benchmark for Language-based Economic Environments

Abstract

arXiv:2410.05254v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) show significant potential in economic and strategic interactions, where communication via natural language is often prevalent. This raises key questions: Do LLMs behave rationally? How do they perform compared to humans? Do they tend to reach an efficient and fair outcome? What is the role of natural language in strategic interaction? How do characteristics of the economic environment influence these dynamics? These questions become crucial concerning the economic and societal implications of integrating LLM-based agents into real-world data-driven systems, such as online retail platforms and recommender systems. To answer these questions, we introduce a benchmark for standardizing research on two-player, sequential, language-based games. Inspired by the economic literature, we define three base families of games with consistent parameterization, degrees of freedom and economic measures to evaluate agents' performance (self-gain), as well as the game outcome (efficiency and fairness). We develop an open-source framework for interaction simulation and analysis, and utilize it to collect a dataset of LLM vs. LLM interactions across numerous game configurations and an additional dataset of human vs. LLM interactions. Through extensive experimentation, we demonstrate how our framework and dataset can be used to: (i) compare the behavior of LLM-based agents in various economic contexts; (ii) evaluate agents in both individual and collective performance measures; and (iii) quantify the effect of the economic characteristics of the environments on the behavior of agents. Our results suggest that the market parameters, as well as the choice of the LLMs, tend to have complex and interdependent effects on the economic outcome, which calls for careful design and analysis of the language-based economic ecosystem.

摘要

大型语言模型（LLM）在涉及自然语言沟通的经济与战略互动中展现出显著潜力，这引发了一系列关键问题：LLM是否表现理性？其表现与人类相比如何？它们是否倾向于达成高效且公平的结果？自然语言在战略互动中扮演何种角色？经济环境特征如何影响这些动态机制？这些问题对于将基于LLM的智能体整合至现实世界数据驱动系统（如在线零售平台和推荐系统）所产生的经济与社会影响至关重要。为回答这些问题，我们提出了一个标准化双人序贯语言博弈研究的基准框架。受经济学文献启发，我们定义了三个基础游戏家族，采用统一参数化设计、自由度设置及经济指标来评估智能体表现（自利收益）和博弈结果（效率与公平性）。我们开发了用于交互模拟与分析的开源框架，并利用该框架收集了多种游戏配置下LLM间交互数据集，以及额外的人类与LLM交互数据集。通过大量实验，我们证明该框架和数据集可用于：(i)比较不同经济情境下基于LLM的智能体行为；(ii)从个体和集体绩效维度评估智能体表现；(iii)量化环境经济特征对智能体行为的影响。研究结果表明，市场参数与LLM选择往往对经济结果产生复杂且相互依赖的影响，这要求对基于语言的经济生态系统进行审慎设计与分析。

MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems

Abstract

arXiv:2412.07067v4 Announce Type: replace-cross Abstract: The sparse Mixture-of-Experts (MoE) architecture is increasingly favored for scaling Large Language Models (LLMs) efficiently, but it depends on heterogeneous compute and memory resources. These factors jointly affect system Cost, Accuracy, and Performance (CAP), making trade-offs inevitable. Existing benchmarks often fail to capture these trade-offs accurately, complicating practical deployment decisions. To address this, we introduce MoE-CAP, a benchmark specifically designed for MoE systems. Our analysis reveals that achieving an optimal balance across CAP is difficult with current hardware; MoE systems typically optimize two of the three dimensions at the expense of the third-a dynamic we term the MoE-CAP trade-off. To visualize this, we propose the CAP Radar Diagram. We further introduce sparsity-aware performance metrics-Sparse Memory Bandwidth Utilization (S-MBU) and Sparse Model FLOPS Utilization (S-MFU)-to enable accurate performance benchmarking of MoE systems across diverse hardware platforms and deployment scenarios.

摘要

稀疏混合专家（MoE）架构因其能高效扩展大语言模型（LLMs）而日益受到青睐，但其依赖于异构计算和内存资源。这些因素共同影响系统的成本、准确性和性能（CAP），使得权衡不可避免。现有基准测试往往无法准确捕捉这些权衡，从而增加了实际部署决策的复杂性。为解决这一问题，我们提出了专为MoE系统设计的基准测试MoE-CAP。我们的分析表明，在当前硬件条件下难以实现CAP三个维度的最优平衡；MoE系统通常只能优化其中两个维度，而牺牲第三个维度——我们将这一动态称为MoE-CAP权衡。为直观展示这一现象，我们提出了CAP雷达图。此外，我们引入了稀疏感知性能指标——稀疏内存带宽利用率（S-MBU）和稀疏模型浮点运算利用率（S-MFU），以实现跨不同硬件平台和部署场景的MoE系统性能精准基准测试。

Evaluating LLM-based Approaches to Legal Citation Prediction: Domain-specific Pre-training, Fine-tuning, or RAG? A Benchmark and an Australian Law Case Study

Abstract

arXiv:2412.06272v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have demonstrated strong potential across legal tasks, yet the problem of legal citation prediction remains under-explored. At its core, this task demands fine-grained contextual understanding and precise identification of relevant legislation or precedent. We introduce the AusLaw Citation Benchmark, a real-world dataset comprising 55k Australian legal instances and 18,677 unique citations which to the best of our knowledge is the first of its scale and scope. We then conduct a systematic benchmarking across a range of solutions: (i) standard prompting of both general and law-specialised LLMs, (ii) retrieval-only pipelines with both generic and domain-specific embeddings, (iii) supervised fine-tuning, and (iv) several hybrid strategies that combine LLMs with retrieval augmentation through query expansion, voting ensembles, or re-ranking. Results show that neither general nor law-specific LLMs suffice as stand-alone solutions, with performance near zero. Instruction tuning (of even a generic open-source LLM) on task-specific dataset is among the best performing solutions. We highlight that database granularity along with the type of embeddings play a critical role in retrieval-based approaches, with hybrid methods which utilise a trained re-ranker delivering the best results. Despite this, a performance gap of nearly 50% remains, underscoring the value of this challenging benchmark as a rigorous test-bed for future research in legal-domain.

摘要

大型语言模型（LLMs）在法律任务中展现出强大潜力，但法律引证预测问题仍未得到充分探索。该任务的核心在于对细粒度上下文的理解以及相关法规或判例的精准识别。我们提出AusLaw引证基准——一个包含5.5万条澳大利亚法律实例和18,677条独特引证的真实数据集，据我们所知，这是首个具有如此规模和范围的数据集。随后我们对多种解决方案进行了系统化基准测试：（i）通用及法律专用LLMs的标准提示，（ii）采用通用与领域专用嵌入的纯检索流程，（iii）监督微调，以及（iv）通过查询扩展、投票集成或重排序将LLMs与检索增强相结合的混合策略。结果表明，通用或法律专用LLMs作为独立解决方案均表现欠佳，准确率接近零。在特定任务数据集上进行指令微调（即使是通用开源LLM）成为最佳解决方案之一。我们强调数据库粒度与嵌入类型在基于检索的方法中至关重要，而采用训练重排序器的混合方法表现最优。尽管如此，仍有近50%的性能差距，这凸显了该挑战性基准作为法律领域未来研究严格测试平台的重要价值。

Breaking Information Cocoons: A Hyperbolic Graph-LLM Framework for Exploration and Exploitation in Recommender Systems

Abstract

arXiv:2411.13865v3 Announce Type: replace-cross Abstract: Modern recommender systems often create information cocoons, restricting users' exposure to diverse content. A key challenge lies in balancing content exploration and exploitation while allowing users to adjust their recommendation preferences. Intuitively, this balance can be modeled as a tree-structured representation, where depth search facilitates exploitation and breadth search enables exploration. However, existing approaches face two fundamental limitations: Euclidean methods struggle to capture hierarchical structures, while hyperbolic methods, despite their superior hierarchical modeling, lack semantic understanding of user and item profiles and fail to provide a principled mechanism for balancing exploration and exploitation. To address these challenges, we propose HERec, a hyperbolic graph-LLM framework that effectively balances exploration and exploitation in recommender systems. Our framework introduces two key innovations: (1) a semantic-enhanced hierarchical mechanism that aligns rich textual descriptions processed by large language models (LLMs) with collaborative information directly in hyperbolic space, allowing for more nuanced updates that respect the underlying hierarchical structure in user-item profiles; (2) an automatic hierarchical representation by optimizing Dasgupta's cost, which discovers hierarchical structures without requiring predefined hyperparameters, enabling user-adjustable exploration-exploitation trade-offs. Extensive experiments demonstrate that HERec consistently outperforms both Euclidean and hyperbolic baselines, achieving up to 5.49% improvement in utility metrics and 11.39% increase in diversity metrics, effectively mitigating information cocoons. We open-source our model implementation at https://github.com/Martin-qyma/HERec.

摘要

现代推荐系统常会形成信息茧房，限制用户接触多样化内容。核心挑战在于平衡内容探索与利用的同时，允许用户调整推荐偏好。直观上，这种平衡可建模为树状结构表示——深度搜索促进利用，广度搜索实现探索。然而现有方法存在两个根本局限：欧式方法难以捕捉层次结构，而双曲方法虽具优越的层次建模能力，却缺乏对用户和项目画像的语义理解，且无法提供探索-利用平衡的原则性机制。为解决这些问题，我们提出HERec框架，这是一种能有效平衡推荐系统中探索与利用的双曲图-大语言模型架构。该框架包含两项关键创新：(1) 语义增强的层次机制，将大语言模型处理的丰富文本描述与协同信息直接在双曲空间对齐，实现尊重用户-项目画像底层层次结构的精细化更新；(2) 通过优化Dasgupta成本实现自动层次表征，无需预定义超参数即可发现层次结构，支持用户可调节的探索-利用权衡。大量实验表明，HERec在效用指标上最高提升5.49%，多样性指标提升11.39%，始终优于欧式和双曲基线方法，有效缓解信息茧房。模型实现已开源：https://github.com/Martin-qyma/HERec。

Code Readability in the Age of Large Language Models: An Industrial Case Study from Atlassian

Abstract

arXiv:2501.11264v2 Announce Type: replace-cross Abstract: Software engineers spend a significant amount of time reading code during the software development process. This trend is amplified by the emergence of large language models (LLMs) that automatically generate code. However, little is known about the readability of the LLM-generated code and whether it is still important from practitioners' perspectives in this new era. In this paper, we conduct a survey to explore the practitioners' perspectives on code readability in the age of LLMs and investigate the readability of our LLM-based software development agents framework, HULA, by comparing its generated code with human-written code in real-world scenarios. Overall, the findings underscore that (1) readability remains a critical aspect of software development; (2) the readability of our LLM-generated code is comparable to human-written code, fostering the establishment of appropriate trust and driving the broad adoption of our LLM-powered software development platform.

摘要

在软件开发过程中，软件工程师需要花费大量时间阅读代码。随着自动生成代码的大型语言模型（LLMs）的出现，这一趋势进一步加剧。然而，目前对于LLM生成代码的可读性以及在这一新时代中从实践者角度来看其重要性仍知之甚少。本文通过一项调查，探讨了LLM时代实践者对代码可读性的看法，并通过在真实场景中将我们基于LLM的软件开发代理框架HULA生成的代码与人工编写的代码进行比较，研究了其可读性。总体而言，研究结果强调：（1）可读性仍然是软件开发的关键方面；（2）我们的LLM生成代码的可读性与人工编写的代码相当，这有助于建立适当的信任并推动我们基于LLM的软件开发平台的广泛采用。

BenCzechMark : A Czech-centric Multitask and Multimetric Benchmark for Large Language Models with Duel Scoring Mechanism

Abstract

arXiv:2412.17933v2 Announce Type: replace-cross Abstract: We present BenCzechMark (BCM), the first comprehensive Czech language benchmark designed for large language models, offering diverse tasks, multiple task formats, and multiple evaluation metrics. Its duel scoring system is grounded in statistical significance theory and uses aggregation across tasks inspired by social preference theory. Our benchmark encompasses 50 challenging tasks, with corresponding test datasets, primarily in native Czech, with 14 newly collected ones. These tasks span 8 categories and cover diverse domains, including historical Czech news, essays from pupils or language learners, and spoken word. Furthermore, we collect and clean BUT-Large Czech Collection, the largest publicly available clean Czech language corpus, and use it for (i) contamination analysis and (ii) continuous pretraining of the first Czech-centric 7B language model with Czech-specific tokenization. We use our model as a baseline for comparison with publicly available multilingual models. Lastly, we release and maintain a leaderboard with existing 50 model submissions, where new model submissions can be made at https://huggingface.co/spaces/CZLC/BenCzechMark.

摘要

我们推出BenCzechMark（BCM）——首个面向大型语言模型的综合性捷克语基准测试工具，提供多样化任务、多任务格式及多维度评估指标。该基准采用基于统计显著性理论的双重评分系统，并通过受社会偏好理论启发的任务聚合方法进行评估。基准包含50项具有挑战性的任务及对应测试数据集，其中14个为新采集数据集，主体为原生捷克语内容。这些任务涵盖8个类别及多个领域，包括历史捷克新闻、学生或语言学习者作文以及口语语料。此外，我们收集并清洗了BUT-Large捷克语语料库（目前最大的公开清洁捷克语语料库），用于：(i) 污染分析；(ii) 首个采用捷克语专属分词器的7B参数捷克中心语言模型的持续预训练。我们以该模型为基线，与现有公开多语言模型进行对比。最后，我们发布并维护包含50个已提交模型的排行榜，新模型可通过https://huggingface.co/spaces/CZLC/BenCzechMark提交。

LITA: An Efficient LLM-assisted Iterative Topic Augmentation Framework

Abstract

arXiv:2412.12459v2 Announce Type: replace-cross Abstract: Topic modeling is widely used for uncovering thematic structures within text corpora, yet traditional models often struggle with specificity and coherence in domain-focused applications. Guided approaches, such as SeededLDA and CorEx, incorporate user-provided seed words to improve relevance but remain labor-intensive and static. Large language models (LLMs) offer potential for dynamic topic refinement and discovery, yet their application often incurs high API costs. To address these challenges, we propose the LLM-assisted Iterative Topic Augmentation framework (LITA), an LLM-assisted approach that integrates user-provided seeds with embedding-based clustering and iterative refinement. LITA identifies a small number of ambiguous documents and employs an LLM to reassign them to existing or new topics, minimizing API costs while enhancing topic quality. Experiments on two datasets across topic quality and clustering performance metrics demonstrate that LITA outperforms five baseline models, including LDA, SeededLDA, CorEx, BERTopic, and PromptTopic. Our work offers an efficient and adaptable framework for advancing topic modeling and text clustering.

摘要

主题建模被广泛应用于揭示文本语料库中的主题结构，但传统模型在领域聚焦应用中常面临主题特异性和连贯性不足的问题。基于引导的方法（如SeededLDA和CorEx）通过引入用户提供的种子词来提升主题相关性，但仍存在人工成本高且静态化的局限。尽管大语言模型（LLMs）具备动态主题优化与发现的潜力，其应用往往伴随高昂的API成本。为解决这些问题，我们提出LLM辅助的迭代式主题增强框架（LITA），该方法将用户提供的种子词与基于嵌入的聚类及迭代优化相结合。LITA通过识别少量歧义文档，利用LLM将其重新分配至现有或新主题，在显著降低API成本的同时提升主题质量。基于主题质量和聚类性能指标的双数据集实验表明，LITA在LDA、SeededLDA、CorEx、BERTopic和PromptTopic五个基线模型中表现最优。本研究为推进主题建模与文本聚类提供了高效且可适配的框架。

Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data

Abstract

arXiv:2502.04380v2 Announce Type: replace-cross Abstract: Fine-tuning large language models (LLMs) using diverse datasets is crucial for enhancing their overall performance across various domains. In practical scenarios, existing methods based on modeling the mixture proportions of data composition often struggle with data whose domain labels are missing, imprecise or non-normalized, while methods based on data selection usually encounter difficulties in balancing multi-domain performance. To address these challenges, in this work, we investigate the role of data diversity in enhancing the overall abilities of LLMs by empirically constructing contrastive data pools and theoretically deriving explanations. Building upon the insights gained, we propose a new method that gives the LLM a dual identity: an output model to cognitively probe and select data based on diversity reward, as well as an input model to be tuned with the selected data. Extensive experiments show that the proposed method notably boosts performance across domain-undetermined data and a series of foundational downstream tasks when applied to various advanced LLMs. We release our code and hope this study can shed light on the understanding of data diversity and advance feedback-driven data-model co-design for LLMs.

摘要

通过多样化数据集对大型语言模型（LLM）进行微调，对于提升其跨领域综合性能至关重要。在实际场景中，现有基于数据构成混合比例建模的方法往往难以处理领域标签缺失、不精确或未规范化的数据，而基于数据选择的方法通常在多领域性能平衡方面存在困难。为解决这些挑战，本研究通过实证构建对比数据池并结合理论推导，深入探究了数据多样性对增强LLM综合能力的作用机制。基于所得洞见，我们提出了一种新方法，赋予LLM双重身份：作为输出模型通过多样性奖励进行认知探测与数据选择，同时作为输入模型利用所选数据进行微调。大量实验表明，该方法应用于各类先进LLM时，能显著提升领域未确定数据及一系列基础下游任务的性能。我们公开了代码，期望本研究能为理解数据多样性提供启示，并推动LLM反馈驱动的数据-模型协同设计发展。

Transferring Textual Preferences to Vision-Language Understanding through Model Merging

Abstract

arXiv:2502.13487v2 Announce Type: replace-cross Abstract: Large vision-language models (LVLMs) perform outstandingly across various multimodal tasks. However, their ability to evaluate generated content remains limited, and training vision-language reward models (VLRMs) with preference data is computationally expensive. This paper explores a training-free alternative by merging text-based reward models (RMs) with LVLMs to create VLRMs. Our approach shows that integrating these models leads to improved performance over LVLMs' scoring and text-based RMs, offering an efficient method for incorporating textual preferences into LVLMs.

摘要

大型视觉语言模型（LVLMs）在各种多模态任务中表现卓越。然而，其评估生成内容的能力仍然有限，且基于偏好数据训练视觉语言奖励模型（VLRMs）的计算成本高昂。本文探索了一种无需训练的替代方案，通过将基于文本的奖励模型（RMs）与LVLMs融合来构建VLRMs。我们的研究表明，这种集成方法在评分性能上超越了LVLMs和基于文本的RMs，为将文本偏好高效融入LVLMs提供了一种有效途径。

InSTA: Towards Internet-Scale Training For Agents

Abstract

arXiv:2502.06776v2 Announce Type: replace-cross Abstract: The predominant approach for training web navigation agents is to gather human demonstrations for a set of popular websites and hand-written tasks, but it is becoming clear that human data is an inefficient resource. We develop a pipeline to facilitate internet-scale training for agents without laborious human annotations. In the first stage, an LLM annotates 150k sites with agentic tasks. In the next stage, LLM agents complete tasks and produce trajectories. In the final stage, an LLM filters trajectories by judging their success. Language models are powerful data curation tools, identifying harmful content with an accuracy of 97%, judging successful trajectories with an accuracy of 82.6%, and producing effective data. We train agents based on Qwen 3 1.7B that are competitive with frontier LLMs as web agents, while being smaller and faster. Our top agent reaches a success rate of 56.9%, outperforming the data collection policy Qwen 3 235B, a 235 times larger Llama 4 Maverick, and reaching 94.7% of the performance of Gemini 2.5 Flash. We are releasing code, models and data at: https://data-for-agents.github.io.

摘要

当前训练网络导航代理的主流方法是收集一组热门网站的人工演示数据和手写任务，但越来越明显的是，人类数据是一种低效资源。我们开发了一个无需繁琐人工标注即可实现互联网规模训练的代理训练流程。第一阶段，大型语言模型（LLM）为15万个网站标注代理任务；第二阶段，LLM代理执行任务并生成轨迹；第三阶段，LLM通过判断轨迹成功率进行筛选。语言模型是强大的数据筛选工具，识别有害内容准确率达97%，判断成功轨迹准确率达82.6%，并能生成有效数据。我们基于Qwen 3 1.7B训练的代理在网络导航任务中可与前沿LLM竞争，同时模型更小、速度更快。我们的最佳代理成功率达到56.9%，超越了数据收集策略Qwen 3 235B、体积大235倍的Llama 4 Maverick，并达到Gemini 2.5 Flash性能的94.7%。代码、模型和数据已发布于：https://data-for-agents.github.io。

C-3PO: Compact Plug-and-Play Proxy Optimization to Achieve Human-like Retrieval-Augmented Generation

Abstract

arXiv:2502.06205v2 Announce Type: replace-cross Abstract: Retrieval-augmented generation (RAG) systems face a fundamental challenge in aligning independently developed retrievers and large language models (LLMs). Existing approaches typically involve modifying either component or introducing simple intermediate modules, resulting in practical limitations and sub-optimal performance. Inspired by human search behavior -- typically involving a back-and-forth process of proposing search queries and reviewing documents, we propose C-3PO, a proxy-centric framework that facilitates communication between retrievers and LLMs through a lightweight multi-agent system. Our framework implements three specialized agents that collaboratively optimize the entire RAG pipeline without altering the retriever and LLMs. These agents work together to assess the need for retrieval, generate effective queries, and select information suitable for the LLMs. To enable effective multi-agent coordination, we develop a tree-structured rollout approach for reward credit assignment in reinforcement learning. Extensive experiments in both in-domain and out-of-distribution scenarios demonstrate that C-3PO significantly enhances RAG performance while maintaining plug-and-play flexibility and superior generalization capabilities.

摘要

检索增强生成（RAG）系统面临一个核心挑战：如何协调独立开发的检索器与大型语言模型（LLM）。现有方法通常通过修改其中某一组件或引入简单的中间模块来实现，这既存在实践局限性又导致性能欠佳。受人类搜索行为（通常包含提出搜索请求与审阅文档的迭代过程）启发，我们提出C-3PO——一个以代理为中心的框架，通过轻量级多智能体系统促进检索器与LLM间的协同交互。该框架部署了三个专用代理，在不修改检索器和LLM的前提下协同优化整个RAG流程：这些代理共同决策检索需求、生成高效查询语句，并筛选适配LLM的信息。为实现有效的多智能体协作，我们开发了基于树状结构展开的强化学习奖励分配方法。在领域内和分布外场景的大量实验表明，C-3PO在保持即插即用灵活性和卓越泛化能力的同时，显著提升了RAG系统的性能。

Agentic AI Software Engineers: Programming with Trust

Abstract

arXiv:2502.13767v3 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have shown surprising proficiency in generating code snippets, promising to automate large parts of software engineering via artificial intelligence (AI). We argue that successfully deploying AI software engineers requires a level of trust equal to or even greater than the trust established by human-driven software engineering practices. The recent trend toward LLM agents offers a path toward integrating the power of LLMs to create new code with the power of analysis tools to increase trust in the code. This opinion piece comments on whether LLM agents could dominate software engineering workflows in the future and whether the focus of programming will shift from programming at scale to programming with trust.

摘要

大型语言模型（LLMs）在生成代码片段方面展现出惊人的能力，有望通过人工智能（AI）实现软件工程的大规模自动化。我们认为，成功部署AI软件工程师需要达到甚至超越人类驱动软件工程实践所建立的信任水平。近期LLM智能体的发展趋势为整合LLMs生成新代码的能力与分析工具增强代码可信度的能力提供了路径。本文探讨了LLM智能体是否会在未来主导软件工程工作流，以及编程重点是否会从规模化编程转向可信编程。

SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models

Abstract

arXiv:2502.09604v2 Announce Type: replace-cross Abstract: We introduce SelfCite, a novel self-supervised approach that aligns LLMs to generate high-quality, fine-grained, sentence-level citations for the statements in their generated responses. Instead of only relying on costly and labor-intensive annotations, SelfCite leverages a reward signal provided by the LLM itself through context ablation: If a citation is necessary, removing the cited text from the context should prevent the same response; if sufficient, retaining the cited text alone should preserve the same response. This reward can guide the inference-time best-of-N sampling strategy to improve citation quality significantly, as well as be used in preference optimization to directly fine-tune the models for generating better citations. The effectiveness of SelfCite is demonstrated by increasing citation F1 up to 5.3 points on the LongBench-Cite benchmark across five long-form question answering tasks. The source code is available at https://github.com/facebookresearch/SelfCite

摘要

我们提出了SelfCite，一种新颖的自监督方法，用于对齐大型语言模型（LLM）以生成高质量、细粒度的句子级引用，为其生成回答中的陈述提供依据。该方法不仅依赖于昂贵且劳动密集的人工标注，而是通过上下文消融利用LLM自身提供的奖励信号：若某引用确有必要，则从上下文中移除被引文本应导致原回答无法生成；若引用充分，则仅保留被引文本应能维持原回答不变。该奖励信号可指导推理阶段的N选一最佳采样策略，显著提升引用质量，同时可用于偏好优化以直接微调模型，从而生成更优质的引用。实验表明，SelfCite在LongBench-Cite基准测试的五个长格式问答任务中，将引用F1值最高提升5.3个百分点。源代码已发布于https://github.com/facebookresearch/SelfCite。

No Need for Explanations: LLMs can implicitly learn from mistakes in-context

Abstract

arXiv:2502.08550v2 Announce Type: replace-cross Abstract: Showing incorrect answers to Large Language Models (LLMs) is a popular strategy to improve their performance in reasoning-intensive tasks. It is widely assumed that, in order to be helpful, the incorrect answers must be accompanied by comprehensive rationales, explicitly detailing where the mistakes are and how to correct them. However, in this work we present a counterintuitive finding: we observe that LLMs perform better in math reasoning tasks when these rationales are eliminated from the context and models are left to infer on their own what makes an incorrect answer flawed. This approach also substantially outperforms chain-of-thought prompting in our evaluations. These results are consistent across LLMs of different sizes and varying reasoning abilities. To gain an understanding of why LLMs learn from mistakes more effectively without explicit corrective rationales, we perform a thorough analysis, investigating changes in context length and answer diversity between different prompting strategies, and their effect on performance. We also examine evidence of overfitting to the in-context rationales when these are provided, and study the extent to which LLMs are able to autonomously infer high-quality corrective rationales given only incorrect answers as input. We find evidence that, while incorrect answers are more beneficial for LLM learning than additional diverse correct answers, explicit corrective rationales over-constrain the model, thus limiting those benefits.

摘要

向大型语言模型（LLM）展示错误答案是一种提升其推理密集型任务表现的常用策略。学界普遍认为，要使这种策略有效，错误答案必须附带全面的解释，明确说明错误所在及修正方法。然而，本研究发现了一个反直觉的现象：在数学推理任务中，当这些解释从上下文中移除、仅由模型自行推断错误答案的缺陷时，LLM的表现反而更优。该方法在我们的评估中也显著优于思维链提示策略。这一结果在不同规模和推理能力的LLM中均保持一致。为探究LLM为何在缺乏显性纠错解释时能从错误中更有效学习，我们进行了系统分析：研究不同提示策略下上下文长度与答案多样性的变化及其对性能的影响；检验提供解释时模型对上下文解释的过拟合现象；并评估LLM仅基于错误答案自主推断高质量纠错解释的能力。研究发现：虽然错误答案比增加多样化的正确答案更有利于LLM学习，但显性纠错解释会过度约束模型，从而削弱这种益处。

Prot2Chat: Protein LLM with Early-Fusion of Text, Sequence and Structure

Abstract

arXiv:2502.06846v2 Announce Type: replace-cross Abstract: Motivation: Proteins are of great significance in living organisms. However, understanding their functions encounters numerous challenges, such as insufficient integration of multimodal information, a large number of training parameters, limited flexibility of classification-based methods, and the lack of systematic evaluation metrics for protein Q&A systems. To tackle these issues, we propose the Prot2Chat framework. Results: We modified ProteinMPNN to encode protein sequence and structural information in a unified way. We used a large language model (LLM) to encode questions into vectors and developed a protein-text adapter to compress protein information into virtual tokens based on these vectors, achieving the early fusion of text and protein information. Finally, the same LLM reads the virtual tokens and the questions to generate answers. To optimize training efficiency, we froze the encoder and employed Low-Rank Adaptation (LoRA) techniques for the LLM. Experiments on two datasets show that both automated metrics and expert evaluations demonstrate the superior performance of our model, and zero-shot prediction results highlight its generalization ability. The models and codes are available at https://github.com/ wangzc1233/Prot2Chat. Contact: zqcao@suda.edu.cn or wangzc025@163.com Key words: Protein Q&A, Early-Fusion, LLM

摘要

动机：蛋白质在生物体中具有极其重要的作用。然而，理解其功能面临着诸多挑战，包括多模态信息整合不足、训练参数量庞大、基于分类方法灵活性有限，以及蛋白质问答系统缺乏系统性评估指标等问题。为解决这些难题，我们提出了Prot2Chat框架。结果：我们改进ProteinMPNN实现了蛋白质序列与结构信息的统一编码，利用大语言模型（LLM）将问题编码为向量，并开发了蛋白质文本适配器以基于这些向量将蛋白质信息压缩为虚拟标记，实现了文本与蛋白质信息的早期融合。最终由同一LLM读取虚拟标记和问题生成答案。为优化训练效率，我们冻结了编码器并对LLM采用低秩自适应（LoRA）技术。在两个数据集上的实验表明，自动化指标与专家评估均证实本模型的优越性能，零样本预测结果则凸显了其泛化能力。模型与代码详见https://github.com/wangzc1233/Prot2Chat。联系人：zqcao@suda.edu.cn或wangzc025@163.com 关键词：蛋白质问答、早期融合、大语言模型

GRIFFIN: Effective Token Alignment for Faster Speculative Decoding

Abstract

arXiv:2502.11018v2 Announce Type: replace-cross Abstract: Speculative decoding accelerates inference in large language models (LLMs) by generating multiple draft tokens simultaneously. However, existing methods often struggle with token misalignment between the training and decoding phases, limiting their performance. To address this, we propose GRIFFIN, a novel framework that incorporates a token-alignable training strategy and a token-alignable draft model to mitigate misalignment. The training strategy employs a loss masking mechanism to exclude highly misaligned tokens during training, preventing them from negatively impacting the draft model's optimization. The token-alignable draft model introduces input tokens to correct inconsistencies in generated features. Experiments on LLaMA, Vicuna, Qwen and Mixtral models demonstrate that GRIFFIN achieves an average acceptance length improvement of over 8% and a speedup ratio exceeding 7%, outperforming current speculative decoding state-of-the-art methods. Our code and GRIFFIN's draft models are released publicly in https://github.com/hsj576/GRIFFIN.

摘要

推测解码通过同时生成多个候选令牌来加速大语言模型（LLM）的推理过程。然而，现有方法常面临训练阶段与解码阶段令牌不对齐的问题，限制了性能提升。为此，我们提出GRIFFIN框架，该框架采用令牌可对齐训练策略和令牌可对齐草稿模型以缓解不对齐现象。训练策略通过损失掩蔽机制排除高度不对齐的令牌，防止其对草稿模型优化产生负面影响；令牌可对齐草稿模型则引入输入令牌以修正生成特征的不一致性。在LLaMA、Vicuna、Qwen和Mixtral模型上的实验表明，GRIFFIN平均接受长度提升超过8%，加速比超过7%，优于当前最先进的推测解码方法。代码及GRIFFIN草稿模型已开源：https://github.com/hsj576/GRIFFIN。

The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Analysis of Orthogonal Safety Directions

Abstract

arXiv:2502.09674v3 Announce Type: replace-cross Abstract: Large Language Models' safety-aligned behaviors, such as refusing harmful queries, can be represented by linear directions in activation space. Previous research modeled safety behavior with a single direction, limiting mechanistic understanding to an isolated safety feature. In this work, we discover that safety-aligned behavior is jointly controlled by multi-dimensional directions. Namely, we study the vector space of representation shifts during safety fine-tuning on Llama 3 8B for refusing jailbreaks. By studying orthogonal directions in the space, we first find that a dominant direction governs the model's refusal behavior, while multiple smaller directions represent distinct and interpretable features like hypothetical narrative and role-playing. We then measure how different directions promote or suppress the dominant direction, showing the important role of secondary directions in shaping the model's refusal representation. Finally, we demonstrate that removing certain trigger tokens in harmful queries can mitigate these directions to bypass the learned safety capability, providing new insights on understanding safety alignment vulnerability from a multi-dimensional perspective. Code and artifacts are available at https://github.com/BMPixel/safety-residual-space.

摘要

大语言模型的安全对齐行为（如拒绝有害查询）可通过激活空间中的线性方向表征。先前研究采用单一方向建模安全行为，将机制理解局限于孤立的安全特征。本研究发现，安全对齐行为实际由多维方向共同调控。具体而言，我们研究了Llama 3 8B模型在安全微调过程中针对越狱拒绝的表征偏移向量空间。通过分析空间中的正交方向，首先发现主导方向控制模型的拒绝行为，而多个次要方向表征了可解释的独立特征（如假设性叙述和角色扮演）。随后测量不同方向对主导方向的促进或抑制效应，揭示次要方向在塑造模型拒绝表征中的重要作用。最后证明，移除有害查询中的特定触发词可削弱这些方向以绕过已学习的安全能力，为从多维视角理解安全对齐脆弱性提供了新见解。代码与实验材料详见https://github.com/BMPixel/safety-residual-space。

Slamming: Training a Speech Language Model on One GPU in a Day

Abstract

arXiv:2502.15814v2 Announce Type: replace-cross Abstract: We introduce Slam, a recipe for training high-quality Speech Language Models (SLMs) on a single academic GPU in 24 hours. We do so through empirical analysis of model initialisation and architecture, synthetic training data, preference optimisation with synthetic data and tweaking all other components. We empirically demonstrate that this training recipe also scales well with more compute getting results on par with leading SLMs in a fraction of the compute cost. We hope these insights will make SLM training and research more accessible. In the context of SLM scaling laws, our results far outperform predicted compute optimal performance, giving an optimistic view to SLM feasibility. See code, data, models, samples at - https://pages.cs.huji.ac.il/adiyoss-lab/slamming .

摘要

我们推出Slam方案，这是一种在单块学术级GPU上24小时内训练高质量语音语言模型（SLM）的方法。通过对模型初始化与架构、合成训练数据、基于合成数据的偏好优化以及所有其他组件的调整进行实证分析，我们实现了这一目标。实验表明该训练方案具有良好的计算扩展性，仅需部分计算成本即可获得与主流SLM相当的结果。希望这些发现能降低SLM训练与研究门槛。在SLM扩展定律背景下，我们的结果远超计算最优性能预测，为SLM可行性提供了乐观前景。代码、数据、模型及样本详见：https://pages.cs.huji.ac.il/adiyoss-lab/slamming。

CoT-ICL Lab: A Synthetic Framework for Studying Chain-of-Thought Learning from In-Context Demonstrations

Abstract

arXiv:2502.15132v3 Announce Type: replace-cross Abstract: We introduce CoT-ICL Lab, a framework and methodology to generate synthetic tokenized datasets and systematically study chain-of-thought (CoT) in-context learning (ICL) in language models. CoT-ICL Lab allows fine grained control over the complexity of in-context examples by decoupling (1) the causal structure involved in chain token generation from (2) the underlying token processing functions. We train decoder-only transformers (up to 700M parameters) on these datasets and show that CoT accelerates the accuracy transition to higher values across model sizes. In particular, we find that model depth is crucial for leveraging CoT with limited in-context examples, while more examples help shallow models match deeper model performance. Additionally, limiting the diversity of token processing functions throughout training improves causal structure learning via ICL. We also interpret these transitions by analyzing transformer embeddings and attention maps. Overall, CoT-ICL Lab serves as a simple yet powerful testbed for theoretical and empirical insights into ICL and CoT in language models.

摘要

我们提出CoT-ICL Lab框架与方法论，用于生成合成标记化数据集并系统研究语言模型中思维链（CoT）的上下文学习（ICL）。该框架通过解耦（1）链式标记生成涉及的因果结构与（2）底层标记处理函数，实现对上下文示例复杂度的细粒度控制。我们在这些数据集上训练仅解码器架构的Transformer模型（参数量达7亿），发现CoT能加速不同规模模型向更高准确率的过渡。特别地，研究发现模型深度对有限上下文示例下的CoT应用至关重要，而增加示例量可使浅层模型达到深层模型的性能水平。此外，训练过程中限制标记处理函数的多样性可通过ICL提升因果结构学习效果。我们通过分析Transformer的嵌入表示和注意力图对这些过渡现象进行解释。总体而言，CoT-ICL Lab为语言模型中ICL与CoT的理论和实证研究提供了简洁而强大的实验平台。

Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents

Abstract

arXiv:2502.20073v2 Announce Type: replace-cross Abstract: Large language models (LLMs) based agent systems have made great strides in real-world applications beyond traditional NLP tasks. This paper proposes a new LLM-powered Multi-Agent System (LLM-MAS) benchmark, Collab-Overcooked, built on the popular Overcooked-AI game with more applicable and challenging tasks in interactive environments. Collab-Overcooked extends existing benchmarks from two novel perspectives. First, it provides a multi-agent framework supporting diverse tasks and objectives and encourages collaboration through natural language communication. Second, it introduces a spectrum of process-oriented evaluation metrics to assess the fine-grained collaboration capabilities of different LLM agents, a dimension often overlooked in prior work. We conduct extensive experiments over 11 popular LLMs and show that, while the LLMs present a strong ability in goal interpretation, there is a significant discrepancy in active collaboration and continuous adaptation which are critical for efficiently fulfilling complicated tasks. Notably, we highlight the strengths and weaknesses in LLM-MAS and provide insights for improving and evaluating LLM-MAS on a unified and open-sourced benchmark. The environments, 30 open-ended tasks, and the evaluation package are publicly available at https://github.com/YusaeMeow/Collab-Overcooked.

摘要

基于大语言模型（LLM）的智能体系统已在传统自然语言处理任务之外的现实应用中取得重大进展。本文提出了一种新型LLM驱动的多智能体系统（LLM-MAS）基准测试Collab-Overcooked，该基准建立在流行的Overcooked-AI游戏基础上，为交互环境设计了更具适用性和挑战性的任务。Collab-Overcooked从两个新颖维度拓展了现有基准：首先，它提供支持多样化任务目标的多智能体框架，通过自然语言通信促进协作；其次，引入了一套面向过程的细粒度评估指标，用于衡量不同LLM智能体的协作能力——这一维度在先前研究中常被忽视。我们对11种主流LLM进行了大量实验，结果表明虽然LLM展现出强大的目标解析能力，但在主动协作和持续适应等对高效完成复杂任务至关重要的维度上仍存在显著差距。值得注意的是，本研究系统揭示了LLM-MAS的优势与不足，并为在统一开源基准上改进和评估LLM-MAS提供了见解。实验环境、30项开放式任务及评估工具包已开源发布于https://github.com/YusaeMeow/Collab-Overcooked。

Steer LLM Latents for Hallucination Detection

Abstract

arXiv:2503.01917v2 Announce Type: replace-cross Abstract: Hallucinations in LLMs pose a significant concern to their safe deployment in real-world applications. Recent approaches have leveraged the latent space of LLMs for hallucination detection, but their embeddings, optimized for linguistic coherence rather than factual accuracy, often fail to clearly separate truthful and hallucinated content. To this end, we propose the Truthfulness Separator Vector (TSV), a lightweight and flexible steering vector that reshapes the LLM's representation space during inference to enhance the separation between truthful and hallucinated outputs, without altering model parameters. Our two-stage framework first trains TSV on a small set of labeled exemplars to form compact and well-separated clusters. It then augments the exemplar set with unlabeled LLM generations, employing an optimal transport-based algorithm for pseudo-labeling combined with a confidence-based filtering process. Extensive experiments demonstrate that TSV achieves state-of-the-art performance with minimal labeled data, exhibiting strong generalization across datasets and providing a practical solution for real-world LLM applications.

摘要

大型语言模型（LLM）中的幻觉现象对其在现实应用中的安全部署构成重大隐患。现有方法虽利用LLM的潜在空间进行幻觉检测，但由于其嵌入空间以语言连贯性而非事实准确性为优化目标，往往难以清晰区分真实内容与幻觉内容。为此，我们提出"真实性分离向量"（TSV）——一种轻量级、灵活的导向向量，可在推理过程中重塑LLM的表征空间以增强真实输出与幻觉输出的分离度，且无需修改模型参数。我们的两阶段框架首先在少量标注样本上训练TSV以形成紧凑且分离良好的聚类，随后通过基于最优传输的伪标注算法结合置信度过滤机制，将未标注的LLM生成内容扩充至样本集。大量实验表明，TSV在极少量标注数据下即可实现最先进的性能，展现出跨数据集的强泛化能力，为实际LLM应用提供了实用解决方案。

HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization

Abstract

arXiv:2503.04598v3 Announce Type: replace-cross Abstract: Transformers have become the de facto architecture for a wide range of machine learning tasks, particularly in large language models (LLMs). Despite their remarkable performance, challenges remain in training deep transformer networks, especially regarding the position of layer normalization. While Pre-Norm structures facilitate more stable training owing to their stronger identity path, they often lead to suboptimal performance compared to Post-Norm. In this paper, we propose $\textbf{HybridNorm}$ , a simple yet effective hybrid normalization strategy that integrates the advantages of both Pre-Norm and Post-Norm. Specifically, HybridNorm employs QKV normalization within the attention mechanism and Post-Norm in the feed-forward network (FFN) of each transformer block. We provide both theoretical insights and empirical evidence demonstrating that HybridNorm improves gradient flow and model robustness. Extensive experiments on large-scale transformer models, including both dense and sparse variants, show that HybridNorm consistently outperforms both Pre-Norm and Post-Norm approaches across multiple benchmarks. These findings highlight the potential of HybridNorm as a more stable and effective technique for improving the training and performance of deep transformer models. Code is available at https://github.com/BryceZhuo/HybridNorm.

摘要

Transformer已成为各类机器学习任务（尤其是大语言模型领域）事实上的标准架构。尽管其性能卓越，但深度Transformer网络的训练仍存在挑战，特别是层归一化的位置问题。虽然Pre-Norm结构因其更强的恒等路径能实现更稳定的训练，但其性能往往逊于Post-Norm。本文提出 $\textbf{HybridNorm}$ ——一种简单有效的混合归一化策略，可整合Pre-Norm与Post-Norm的优势。具体而言，HybridNorm在注意力机制中采用QKV归一化，而在每个Transformer块的前馈网络（FFN）中使用Post-Norm。我们通过理论分析和实证研究表明，HybridNorm能改善梯度流动并增强模型鲁棒性。在大规模Transformer模型（包括稠密和稀疏变体）上的大量实验表明，HybridNorm在多个基准测试中 consistently 优于Pre-Norm和Post-Norm方法。这些发现凸显了HybridNorm作为一种更稳定有效的技术，在改进深度Transformer模型训练与性能方面的潜力。代码已发布于https://github.com/BryceZhuo/HybridNorm。

Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts

Abstract

arXiv:2503.05066v2 Announce Type: replace-cross Abstract: The Mixture of Experts (MoE) is an effective architecture for scaling large language models by leveraging sparse expert activation, optimizing the trade-off between performance and efficiency. However, under expert parallelism, MoE suffers from inference inefficiencies due to imbalanced token-to-expert assignment, where some experts are overloaded while others remain underutilized. This imbalance leads to poor resource utilization and increased latency, as the most burdened expert dictates the overall delay, a phenomenon we define as the \textbf{\textit{Straggler Effect}}. To mitigate this, we propose Capacity-Aware Inference, including two key techniques: (1) \textbf{\textit{Capacity-Aware Token Drop}}, which discards overloaded tokens to regulate the maximum latency of MoE, and (2) \textbf{\textit{Capacity-Aware Token Reroute}}, which reallocates overflowed tokens to underutilized experts, balancing the token distribution. These techniques collectively optimize both high-load and low-load expert utilization, leading to a more efficient MoE inference pipeline. Extensive experiments demonstrate the effectiveness of our methods, showing significant improvements in inference efficiency, e.g., 0.2% average performance increase and a 1.94 $\times$ inference speedup on Mixtral-8 $\times$ 7B-Instruct.

摘要

混合专家模型（MoE）通过稀疏专家激活机制有效扩展了大语言模型的规模，优化了性能与效率的权衡。然而在专家并行架构下，MoE因令牌与专家分配不均衡而存在推理效率低下的问题：部分专家过载而其他专家利用率不足。这种失衡导致资源利用率下降和延迟增加，最繁忙的专家决定了整体延迟，我们将此现象定义为"拖尾效应"。为缓解该问题，我们提出容量感知推理技术，包含两项关键技术：(1) "容量感知令牌丢弃"机制，通过丢弃过载令牌来调控MoE的最大延迟；(2) "容量感知令牌重路由"机制，将溢出令牌重新分配给低利用率专家以平衡令牌分布。这些技术协同优化了高负载与低负载专家的利用率，构建了更高效的MoE推理流程。大量实验验证了方法的有效性，在Mixtral-8×7B-Instruct模型上实现了0.2%的平均性能提升和1.94×的推理加速。

From 1,000,000 Users to Every User: Scaling Up Personalized Preference for User-level Alignment

Abstract

arXiv:2503.15463v3 Announce Type: replace-cross Abstract: Large language models (LLMs) have traditionally been aligned through one-size-fits-all approaches that assume uniform human preferences, fundamentally overlooking the diversity in user values and needs. This paper introduces a comprehensive framework for scalable personalized alignment of LLMs. We establish a systematic preference space characterizing psychological and behavioral dimensions, alongside diverse persona representations for robust preference inference in real-world scenarios. Building upon this foundation, we introduce \textsc{AlignX}, a large-scale dataset of over 1.3 million personalized preference examples, and develop two complementary alignment approaches: \textit{in-context alignment} directly conditioning on persona representations and \textit{preference-bridged alignment} modeling intermediate preference distributions. Extensive experiments demonstrate substantial improvements over existing methods, with an average 17.06% accuracy gain across four benchmarks while exhibiting a strong adaptation capability to novel preferences, robustness to limited user data, and precise preference controllability. These results validate our approach toward user-adaptive AI systems.

摘要

传统的大型语言模型（LLM）对齐方法采用"一刀切"策略，假设人类偏好具有统一性，本质上忽视了用户价值观与需求的多样性。本文提出一个可扩展的个性化LLM对齐综合框架：首先构建系统化的偏好空间以刻画心理和行为维度，并设计多样化的人物表征以实现现实场景中的稳健偏好推断。基于此，我们提出包含130万条个性化偏好实例的大规模数据集\textsc{AlignX}，并开发两种互补的对齐方法——直接基于人物表征的\textit{上下文对齐}与建模中间偏好分布的\textit{偏好桥接对齐}。大量实验表明，该方法在四个基准测试中平均准确率提升17.06%，同时展现出对新型偏好的强适应能力、有限用户数据下的鲁棒性以及精确的偏好可控性。这些结果验证了我们构建用户自适应AI系统的有效性。

Dion: Distributed Orthonormalized Updates

Abstract

arXiv:2504.05295v2 Announce Type: replace-cross Abstract: Recent work has shown that orthonormal matrix updates speed up neural network optimization, improve training stability, and offer better hyperparameter transfer across model sizes. Applying these updates efficiently when model weights and optimizer states are sharded across a large-scale distributed LLM training system remains a major challenge. We introduce Dion (DIstributed OrthoNormalization), a scalable and communication-efficient orthonormalizing optimizer. Dion leverages low-rank approximation and decoupled momentum buffers, eliminating the need for full gradient synchronization while producing numerically equivalent results. It is compatible with simultaneous DDP, FSDP, and TP parallelism, and it computes an orthonormalized update without unsharding a full parameter matrix on any single device. We evaluate Dion on language models from 120M to 3B parameters and find that its benefits improve with increasing model size and batch size.

摘要

近期研究表明，正交矩阵更新能加速神经网络优化、提升训练稳定性，并在不同规模模型间实现更好的超参数迁移。然而，当模型权重和优化器状态分散于大规模分布式LLM训练系统时，如何高效应用这些更新仍存在重大挑战。本文提出Dion（分布式正交归一化优化器），这是一种可扩展且通信高效的正交化优化器。Dion采用低秩近似与解耦动量缓冲技术，在无需全梯度同步的情况下即可产生数值等效结果。该方法可同时兼容DDP、FSDP和TP并行架构，且无需在任何单一设备上反分片完整参数矩阵即可计算正交化更新。我们在1.2亿至30亿参数的语言模型上评估Dion，发现其优势随模型规模和批量大小的增加而显著提升。

Hallucination Detection in LLMs with Topological Divergence on Attention Graphs

Abstract

arXiv:2504.10063v2 Announce Type: replace-cross Abstract: Hallucination, i.e., generating factually incorrect content, remains a critical challenge for large language models (LLMs). We introduce TOHA, a TOpology-based HAllucination detector in the RAG setting, which leverages a topological divergence metric to quantify the structural properties of graphs induced by attention matrices. Examining the topological divergence between prompt and response subgraphs reveals consistent patterns: higher divergence values in specific attention heads correlate with hallucinated outputs, independent of the dataset. Extensive experiments - including evaluation on question answering and summarization tasks - show that our approach achieves state-of-the-art or competitive results on several benchmarks while requiring minimal annotated data and computational resources. Our findings suggest that analyzing the topological structure of attention matrices can serve as an efficient and robust indicator of factual reliability in LLMs.

摘要

幻觉（即生成事实错误内容）仍是大型语言模型（LLM）面临的关键挑战。本文提出TOHA——一种基于拓扑结构的检索增强生成环境下的幻觉检测器，该方法通过拓扑散度度量来量化注意力矩阵所诱导图的结构特性。研究发现提示子图与响应子图间的拓扑散度呈现稳定规律：特定注意力头中较高的散度值与幻觉输出显著相关，且该现象具有数据集无关性。在问答和摘要任务上的大量实验表明，我们的方法在多个基准测试中达到或接近最优性能，同时仅需少量标注数据和计算资源。这些发现证明，分析注意力矩阵的拓扑结构可作为评估LLM事实可靠性的高效鲁棒指标。

MindGYM: What Matters in Question Synthesis for Thinking-Centric Fine-Tuning?

Abstract

arXiv:2503.09499v2 Announce Type: replace-cross Abstract: Large foundation models face challenges in acquiring transferable, structured thinking abilities, especially when supervised with rigid templates or crowd-annotated instruction datasets. Unlike prior approaches, we focus on a thinking-centric data synthesis paradigm that enables models to evolve through self-generated, cognitively guided data. We propose MindGYM, a structured and scalable framework for question synthesis, composed of: (1) Cognitive Thinking Process Injection, which infuses high-level reasoning objectives to shape the model's synthesis behavior; (2) Seed Single-Hop Question Synthesis, generating atomic questions from diverse semantic types to encourage broader thinking; and (3) Challenging Multi-Hop QA Synthesis, composing more complex multi-hop questions based on QA seeds for deeper reasoning. Detailed analysis shows that synthetic data generated by our method achieves 16.7% higher average quality and 67.91% lower quality variance compared to baseline sources, highlighting that both high-quality and self-contained data are essential for effective, thinking-oriented fine-tuning. MindGYM improves performance on six reasoning benchmarks, achieving gains of up to 16% on MathVision using only 400 data samples, and generalizable improvements across different model sizes and architectures. MindGYM underscores the viability of self-challenging mechanisms in refining large model capabilities while minimizing human intervention and resource demands. Code and data are released to promote data-centric research into self-evolving foundation models driven by their internal reasoning capabilities.

摘要

大型基础模型在获取可迁移的结构化思维能力方面面临挑战，尤其是在采用固定模板或众包标注指令数据集进行监督训练时。与现有方法不同，我们提出以思维为核心的数据合成范式，使模型能够通过自我生成、认知引导的数据实现进化。我们设计MindGYM这一结构化、可扩展的问题合成框架，包含三个核心组件：（1）认知思维过程注入——通过高层级推理目标塑造模型的合成行为；（2）种子单跳问题合成——从多样化语义类型生成原子问题以拓宽思维广度；（3）挑战性多跳问答合成——基于问答种子构建更复杂的多跳问题以实现深度推理。详细分析表明，本方法生成的合成数据相比基线源数据平均质量提升16.7%，质量方差降低67.91%，证实高质量且自包含的数据对思维导向的微调至关重要。MindGYM在六大推理基准测试中取得性能提升，仅用400个数据样本即在MathVision上实现最高16%的增益，且改进效果在不同模型规模和架构中均具普适性。该框架验证了通过自我挑战机制精进大模型能力的可行性，同时显著降低人工干预和资源需求。我们公开代码和数据以推动基于内部推理能力的自进化基础模型研究。

Robust and Fine-Grained Detection of AI Generated Texts

Abstract

arXiv:2504.11952v2 Announce Type: replace-cross Abstract: An ideal detection system for machine generated content is supposed to work well on any generator as many more advanced LLMs come into existence day by day. Existing systems often struggle with accurately identifying AI-generated content over shorter texts. Further, not all texts might be entirely authored by a human or LLM, hence we focused more over partial cases i.e human-LLM co-authored texts. Our paper introduces a set of models built for the task of token classification which are trained on an extensive collection of human-machine co-authored texts, which performed well over texts of unseen domains, unseen generators, texts by non-native speakers and those with adversarial inputs. We also introduce a new dataset of over 2.4M such texts mostly co-authored by several popular proprietary LLMs over 23 languages. We also present findings of our models' performance over each texts of each domain and generator. Additional findings include comparison of performance against each adversarial method, length of input texts and characteristics of generated texts compared to the original human authored texts.

摘要

随着日益先进的大型语言模型不断涌现，理想的机器生成内容检测系统应能在任何生成器上均表现良好。现有系统往往难以准确识别短文本中的AI生成内容。此外，并非所有文本都完全由人类或LLM创作，因此我们更关注部分创作场景（即人机协作文本）。本文提出一组专用于标记分类任务的模型，这些模型在大量人机协作文本数据集上进行训练，在未见领域文本、未知生成器文本、非母语者文本及对抗性输入文本上均表现优异。我们同时发布了一个包含240万条文本的新数据集，主要由多种流行商业LLM以23种语言协作创作完成。研究还展示了模型在各领域和各生成器文本上的性能表现，并额外分析了对抗方法效果对比、输入文本长度影响，以及生成文本相较于原始人类创作文本的特征差异。

ASMA-Tune: Unlocking LLMs' Assembly Code Comprehension via Structural-Semantic Instruction Tuning

Abstract

arXiv:2503.11617v2 Announce Type: replace-cross Abstract: Assembly code analysis and comprehension play critical roles in applications like reverse engineering, yet they face substantial challenges due to low information density and a lack of explicit syntactic structures. While traditional masked language modeling (MLM) approaches do not explicitly focus on natural language interaction, emerging decoder-focused large language models (LLMs) demonstrate partial success in binary analysis yet remain underexplored for holistic comprehension. We present Assembly Augmented Tuning, an end-to-end structural-semantic instruction tuning framework that synergizes encoder architecture with decoder-based LLMs through a projector module, where the assembly encoder extracts hardware-level structural features, the projector bridges representations with the semantic space, and the instruction-tuned LLM preserves natural language capabilities. Experimental results demonstrate three key advantages: (1) State-of-the-art performance in assembly comprehension with +39.7% Recall@1 and +17.8% MRR improvements over GPT-4-Turbo, (2) Consistent enhancements across base models (24.6-107.4% Recall@1 and 15.2-106.3% MRR on Qwen2.5-Coder, Deepseek-Coder and CodeLlama variants), and (3) Superior instruction-following capabilities (41.5%-118% improvements) with controlled code generation degradation (-8.9% to -35% across architectures).

摘要

汇编代码分析与理解在逆向工程等应用中具有关键作用，但由于信息密度低且缺乏显式句法结构，这些任务面临重大挑战。传统掩码语言建模（MLM）方法未明确关注自然语言交互，而新兴的以解码器为核心的大语言模型（LLMs）在二进制分析中虽取得部分成功，但对整体理解的研究仍不充分。我们提出"汇编增强调优"——一种端到端的结构-语义指令调优框架，通过投影模块将编码器架构与基于解码器的LLMs协同整合：汇编编码器提取硬件级结构特征，投影模块实现表征与语义空间的桥接，指令调优后的LLM则保留自然语言处理能力。实验结果表明该框架具有三大优势：（1）汇编理解性能达到最先进水平，Recall@1指标较GPT-4-Turbo提升39.7%，MRR提升17.8%；（2）在不同基础模型（Qwen2.5-Coder、Deepseek-Coder及CodeLlama变体）上均实现持续增强，Recall@1提升24.6-107.4%，MRR提升15.2-106.3%；（3）具备卓越的指令跟随能力（提升41.5%-118%），且代码生成性能受控下降（不同架构下降8.9%至35%）。

Improving Multilingual Capabilities with Cultural and Local Knowledge in Large Language Models While Enhancing Native Performance

Abstract

arXiv:2504.09753v2 Announce Type: replace-cross Abstract: Large Language Models (LLMs) have shown remarkable capabilities, but their development has primarily focused on English and other high-resource languages, leaving many languages underserved. We present our latest Hindi-English bi-lingual LLM \textbf{Mantra-14B} with ~3% average improvement in benchmark scores over both languages, outperforming models twice its size. Using a curated dataset composed of English and Hindi instruction data of 485K samples, we instruction tuned models such as Qwen-2.5-14B-Instruct and Phi-4 to improve performance over both English and Hindi. Our experiments encompassing seven different LLMs of varying parameter sizes and over 140 training attempts with varying English-Hindi training data ratios demonstrated that it is possible to significantly improve multilingual performance without compromising native performance. Further, our approach avoids resource-intensive techniques like vocabulary expansion or architectural modifications, thus keeping the model size small. Our results indicate that modest fine-tuning with culturally and locally informed data can bridge performance gaps without incurring significant computational overhead. We release our training code, datasets, and models under mit and apache licenses to aid further research towards under-represented and low-resource languages.

摘要

大型语言模型（LLMs）已展现出卓越能力，但其发展主要集中于英语等高资源语言，导致许多语言服务不足。我们推出最新的印地语-英语双语LLM \textbf{Mantra-14B}，其在两种语言的基准测试中平均提升约3%，性能超越参数规模两倍的模型。通过使用包含48.5万条英印双语指令数据的精选数据集，我们对Qwen-2.5-14B-Instruct和Phi-4等模型进行指令微调，显著提升了两种语言的性能。实验涵盖七种不同参数规模的LLM及140余次训练尝试（采用不同英印数据配比），证明可在保持原生语言性能的同时显著提升多语言能力。该方法避免词汇扩展或架构修改等资源密集型技术，从而保持较小模型规模。结果表明，采用具有文化及地域特色的数据进行适度微调，可在不显著增加计算开销的情况下弥合性能差距。我们将训练代码、数据集及模型以MIT和Apache协议开源，以促进对低资源及代表性不足语言的进一步研究。

Logic-of-Thought: Empowering Large Language Models with Logic Programs for Solving Puzzles in Natural Language
- Abstract
- 摘要
SPhyR: Spatial-Physical Reasoning Benchmark on Material Distribution
- Abstract
- 摘要
Causal LLM Routing: End-to-End Regret Minimization from Observational Data
- Abstract
- 摘要
Optimizing LLM-Based Multi-Agent System with Textual Feedback: A Case Study on Software Development
- Abstract
- 摘要
LLM-Powered AI Agent Systems and Their Applications in Industry
- Abstract
- 摘要
TrialPanorama: Database and Benchmark for Systematic Review and Design of Clinical Trials
- Abstract
- 摘要
Sudoku-Bench: Evaluating creative reasoning with Sudoku variants
- Abstract
- 摘要
How Memory Management Impacts LLM Agents: An Empirical Study of Experience-Following Behavior
- Abstract
- 摘要
Can AI Read Between The Lines? Benchmarking LLMs On Financial Nuance
- Abstract
- 摘要
MAPS: A Multilingual Benchmark for Global Agent Performance and Security
- Abstract
- 摘要
LightRouter: Towards Efficient LLM Collaboration with Minimal Overhead
- Abstract
- 摘要
MAPLE: Many-Shot Adaptive Pseudo-Labeling for In-Context Learning
- Abstract
- 摘要
SafeKey: Amplifying Aha-Moment Insights for Safety Reasoning
- Abstract
- 摘要
Losing is for Cherishing: Data Valuation Based on Machine Unlearning and Shapley Value
- Abstract
- 摘要
Dynamic Sampling that Adapts: Iterative DPO for Self-Aware Mathematical Reasoning
- Abstract
- 摘要
No Black Boxes: Interpretable and Interactable Predictive Healthcare with Knowledge-Enhanced Agentic Causal Discovery
- Abstract
- 摘要
EquivPruner: Boosting Efficiency and Quality in LLM-Based Search via Action Pruning
- Abstract
- 摘要
How do Scaling Laws Apply to Knowledge Graph Engineering Tasks? The Impact of Model Size on Large Language Model Performance
- Abstract
- 摘要
Incentivizing Dual Process Thinking for Efficient Large Language Model Reasoning
- Abstract
- 摘要
Smaller, Smarter, Closer: The Edge of Collaborative Generative AI
- Abstract
- 摘要
Internal Bias in Reasoning Models leads to Overthinking
- Abstract
- 摘要
ReflectEvo: Improving Meta Introspection of Small LLMs by Learning Self-Reflection
- Abstract
- 摘要
Advancing the Scientific Method with Large Language Models: From Hypothesis to Discovery
- Abstract
- 摘要
FREESON: Retriever-Free Retrieval-Augmented Reasoning via Corpus-Traversing MCTS
- Abstract
- 摘要
Edge-First Language Model Inference: Models, Metrics, and Tradeoffs
- Abstract
- 摘要
MMMR: Benchmarking Massive Multi-Modal Reasoning Tasks
- Abstract
- 摘要
Recursive Offloading for LLM Serving in Multi-tier Networks
- Abstract
- 摘要
Is Your LLM-Based Multi-Agent a Reliable Real-World Planner? Exploring Fraud Detection in Travel Planning
- Abstract
- 摘要
Bridging the Dynamic Perception Gap: Training-Free Draft Chain-of-Thought for Dynamic Multimodal Spatial Reasoning
- Abstract
- 摘要
SMART: Self-Generating and Self-Validating Multi-Dimensional Assessment for LLMs' Mathematical Problem Solving
- Abstract
- 摘要
ELABORATION: A Comprehensive Benchmark on Human-LLM Competitive Programming
- Abstract
- 摘要
Data-Driven Breakthroughs and Future Directions in AI Infrastructure: A Comprehensive Review
- Abstract
- 摘要
MCP-RADAR: A Multi-Dimensional Benchmark for Evaluating Tool Use Capabilities in Large Language Models
- Abstract
- 摘要
KTAE: A Model-Free Algorithm to Key-Tokens Advantage Estimation in Mathematical Reasoning
- Abstract
- 摘要
Beyond Correlation: Towards Causal Large Language Model Agents in Biomedicine
- Abstract
- 摘要
Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models
- Abstract
- 摘要
HyGenar: An LLM-Driven Hybrid Genetic Algorithm for Few-Shot Grammar Generation
- Abstract
- 摘要
Know the Ropes: A Heuristic Strategy for LLM-based Multi-Agent System Design
- Abstract
- 摘要
Problem-Solving Logic Guided Curriculum In-Context Learning for LLMs Complex Reasoning
- Abstract
- 摘要
X-MAS: Towards Building Multi-Agent Systems with Heterogeneous LLMs
- Abstract
- 摘要
AGENTIF: Benchmarking Instruction Following of Large Language Models in Agentic Scenarios
- Abstract
- 摘要
Transforming Decoder-Only Transformers for Accurate WiFi-Telemetry Based Indoor Localization
- Abstract
- 摘要
UltraEdit: Training-, Subject-, and Memory-Free Lifelong Editing in Large Language Models
- Abstract
- 摘要
What Lives? A meta-analysis of diverse opinions on the definition of life
- Abstract
- 摘要
AutoData: A Multi-Agent System for Open Web Data Collection
- Abstract
- 摘要
GRIT: Teaching MLLMs to Think with Images
- Abstract
- 摘要
Extracting Probabilistic Knowledge from Large Language Models for Bayesian Network Parameterization
- Abstract
- 摘要
Causal Interventions Reveal Shared Structure Across English Filler-Gap Constructions
- Abstract
- 摘要
Pre-training Large Memory Language Models with Internal and External Knowledge
- Abstract
- 摘要
VERDI: VLM-Embedded Reasoning for Autonomous Driving
- Abstract
- 摘要
Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey
- Abstract
- 摘要
Interpretability Illusions with Sparse Autoencoders: Evaluating Robustness of Concept Representations
- Abstract
- 摘要
SLMEval: Entropy-Based Calibration for Human-Aligned Evaluation of Large Language Models
- Abstract
- 摘要
Pixel Reasoner: Incentivizing Pixel-Space Reasoning with Curiosity-Driven Reinforcement Learning
- Abstract
- 摘要
NOVER: Incentive Training for Language Models via Verifier-Free Reinforcement Learning
- Abstract
- 摘要
Merge to Mix: Mixing Datasets via Model Merging
- Abstract
- 摘要
Not All Models Suit Expert Offloading: On Local Routing Consistency of Mixture-of-Expert Models
- Abstract
- 摘要
Date Fragments: A Hidden Bottleneck of Tokenization for Temporal Reasoning
- Abstract
- 摘要
Steering LVLMs via Sparse Autoencoder for Hallucination Mitigation
- Abstract
- 摘要
QuickVideo: Real-Time Long Video Understanding with System Algorithm Co-Design
- Abstract
- 摘要
NQKV: A KV Cache Quantization Scheme Based on Normal Distribution Characteristics
- Abstract
- 摘要
Explain Less, Understand More: Jargon Detection via Personalized Parameter-Efficient Fine-tuning
- Abstract
- 摘要
VLM-R^3: Region Recognition, Reasoning, and Refinement for Enhanced Multimodal Chain-of-Thought
- Abstract
- 摘要
AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models
- Abstract
- 摘要
DualComp: End-to-End Learning of a Unified Dual-Modality Lossless Compressor
- Abstract
- 摘要
LIFEBench: Evaluating Length Instruction Following in Large Language Models
- Abstract
- 摘要
Transformer Copilot: Learning from The Mistake Log in LLM Fine-tuning
- Abstract
- 摘要
DriveMoE: Mixture-of-Experts for Vision-Language-Action Model in End-to-End Autonomous Driving
- Abstract
- 摘要
PMPO: Probabilistic Metric Prompt Optimization for Small and Large Language Models
- Abstract
- 摘要
AdaSTaR: Adaptive Data Sampling for Training Self-Taught Reasoners
- Abstract
- 摘要
SC4ANM: Identifying Optimal Section Combinations for Automated Novelty Prediction in Academic Papers
- Abstract
- 摘要
AdamS: Momentum Itself Can Be A Normalizer for LLM Pretraining and Post-training
- Abstract
- 摘要
Resource for Error Analysis in Text Simplification: New Taxonomy and Test Collection
- Abstract
- 摘要
Attributing Response to Context: A Jensen-Shannon Divergence Driven Mechanistic Study of Context Attribution in Retrieval-Augmented Generation
- Abstract
- 摘要
SATURN: SAT-based Reinforcement Learning to Unleash Language Model Reasoning
- Abstract
- 摘要
Sparse Activation Editing for Reliable Instruction Following in Narratives
- Abstract
- 摘要
AceReason-Nemotron: Advancing Math and Code Reasoning through Reinforcement Learning
- Abstract
- 摘要
Tool-Star: Empowering LLM-Brained Multi-Tool Reasoner via Reinforcement Learning
- Abstract
- 摘要
Human-like Semantic Navigation for Autonomous Driving using Knowledge Representation and Large Language Models
- Abstract
- 摘要
LLaMAs Have Feelings Too: Unveiling Sentiment and Emotion Representations in LLaMA Models Through Probing
- Abstract
- 摘要
Teaching Large Language Models to Maintain Contextual Faithfulness via Synthetic Tasks and Reinforcement Learning
- Abstract
- 摘要
Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models
- Abstract
- 摘要
Benchmarking and Pushing the Multi-Bias Elimination Boundary of LLMs via Causal Effect Estimation-guided Debiasing
- Abstract
- 摘要
Are the Hidden States Hiding Something? Testing the Limits of Factuality-Encoding Capabilities in LLMs
- Abstract
- 摘要
DuFFin: A Dual-Level Fingerprinting Framework for LLMs IP Protection
- Abstract
- 摘要
CUB: Benchmarking Context Utilisation Techniques for Language Models
- Abstract
- 摘要
Steering Large Language Models for Machine Translation Personalization
- Abstract
- 摘要
Collaboration among Multiple Large Language Models for Medical Question Answering
- Abstract
- 摘要
Finetuning-Activated Backdoors in LLMs
- Abstract
- 摘要
O^2-Searcher: A Searching-based Agent Model for Open-Domain Open-Ended Question Answering
- Abstract
- 摘要
SSR-Zero: Simple Self-Rewarding Reinforcement Learning for Machine Translation
- Abstract
- 摘要
R1-ShareVL: Incentivizing Reasoning Capability of Multimodal Large Language Models via Share-GRPO
- Abstract
- 摘要
Beyond Induction Heads: In-Context Meta Learning Induces Multi-Phase Circuit Emergence
- Abstract
- 摘要
From Evaluation to Defense: Advancing Safety in Video Large Language Models
- Abstract
- 摘要
BitHydra: Towards Bit-flip Inference Cost Attack against Large Language Models
- Abstract
- 摘要
Breaking mBad! Supervised Fine-tuning for Cross-Lingual Detoxification
- Abstract
- 摘要
Your Pre-trained LLM is Secretly an Unsupervised Confidence Calibrator
- Abstract
- 摘要
Training Long-Context LLMs Efficiently via Chunk-wise Optimization
- Abstract
- 摘要
Mitigating Fine-tuning Risks in LLMs via Safety-Aware Probing Optimization
- Abstract
- 摘要
CoTSRF: Utilize Chain of Thought as Stealthy and Robust Fingerprint of Large Language Models
- Abstract
- 摘要
When Safety Detectors Aren't Enough: A Stealthy and Effective Jailbreak Attack on LLMs via Steganographic Techniques
- Abstract
- 摘要
Accidental Misalignment: Fine-Tuning Language Models Induces Unexpected Vulnerability
- Abstract
- 摘要
TRIM: Achieving Extreme Sparsity with Targeted Row-wise Iterative Metric-driven Pruning
- Abstract
- 摘要
Unlearning Isn't Deletion: Investigating Reversibility of Machine Unlearning in LLMs
- Abstract
- 摘要
SimpleDeepSearcher: Deep Information Seeking via Web-Powered Reasoning Trajectory Synthesis
- Abstract
- 摘要
CASTILLO: Characterizing Response Length Distributions of Large Language Models
- Abstract
- 摘要
Don't "Overthink" Passage Reranking: Is Reasoning Truly Necessary?
- Abstract
- 摘要
CAIN: Hijacking LLM-Humans Conversations via a Two-Stage Malicious System Prompt Generation and Refining Framework
- Abstract
- 摘要
Invisible Prompts, Visible Threats: Malicious Font Injection in External Resources for Large Language Models
- Abstract
- 摘要
Latent Principle Discovery for Language Model Self-Improvement
- Abstract
- 摘要
Bottlenecked Transformers: Periodic KV Cache Abstraction for Generalised Reasoning
- Abstract
- 摘要
\text{R}^2\text{ec}: Towards Large Recommender Models with Reasoning
- Abstract
- 摘要
T1: A Tool-Oriented Conversational Dataset for Multi-Turn Agentic Planning
- Abstract
- 摘要
MASLab: A Unified and Comprehensive Codebase for LLM-based Multi-Agent Systems
- Abstract
- 摘要
MixAT: Combining Continuous and Discrete Adversarial Training for LLMs
- Abstract
- 摘要
Fixing Data That Hurts Performance: Cascading LLMs to Relabel Hard Negatives for Robust Information Retrieval
- Abstract
- 摘要
R1-Searcher++: Incentivizing the Dynamic Knowledge Acquisition of LLMs via Reinforcement Learning
- Abstract
- 摘要
Do Large Language Models Excel in Complex Logical Reasoning with Formal Language?
- Abstract
- 摘要
Understanding Prompt Tuning and In-Context Learning via Meta-Learning
- Abstract
- 摘要
SpatialScore: Towards Unified Evaluation for Multimodal Spatial Understanding
- Abstract
- 摘要
Let Androids Dream of Electric Sheep: A Human-like Image Implication Understanding and Reasoning Framework
- Abstract
- 摘要
Delving into RL for Image Generation with CoT: A Study on DPO vs. GRPO
- Abstract
- 摘要
FiDeLiS: Faithful Reasoning in Large Language Model for Knowledge Graph Question Answering
- Abstract
- 摘要
Judgment-of-Thought Prompting: A Courtroom-Inspired Framework for Binary Logical Reasoning with Large Language Models
- Abstract
- 摘要
Bias Amplification: Large Language Models as Increasingly Biased Media
- Abstract
- 摘要
HyperGraphRAG: Retrieval-Augmented Generation via Hypergraph-Structured Knowledge Representation
- Abstract
- 摘要
Meta-Reasoner: Dynamic Guidance for Optimized Inference-time Reasoning in Large Language Models
- Abstract
- 摘要
To Code or not to Code? Adaptive Tool Integration for Math Language Models via Expectation-Maximization
- Abstract
- 摘要
QLLM: Do We Really Need a Mixing Network for Credit Assignment in Multi-Agent Reinforcement Learning?
- Abstract
- 摘要
How Well Can a Long Sequence Model Model Long Sequences? Comparing Architechtural Inductive Biases on Long-Context Abilities
- Abstract
Worse than Zero-shot? A Fact-Checking Dataset for Evaluating the Robustness of RAG Against Misleading Retrievals
- Abstract
- 摘要
DynaServe: Unified and Elastic Execution for Dynamic Disaggregated LLM Serving
- Abstract
- 摘要
Large Language Models are Miscalibrated In-Context Learners
- Abstract
- 摘要
CodeMind: Evaluating Large Language Models for Code Reasoning
- Abstract
- 摘要
Permissive Information-Flow Analysis for Large Language Models
- Abstract
- 摘要
More Text, Less Point: Towards 3D Data-Efficient Point-Language Understanding
- Abstract
- 摘要
Discovering Spoofing Attempts on Language Model Watermarks
- Abstract
- 摘要
Domain-Oriented Time Series Inference Agents for Reasoning and Automated Analysis
- Abstract
- 摘要
Do Robot Snakes Dream like Electric Sheep? Investigating the Effects of Architectural Inductive Biases on Hallucination
- Abstract
- 摘要
Graph-based Confidence Calibration for Large Language Models
- Abstract
- 摘要
Model-based Large Language Model Customization as Service
- Abstract
- 摘要
Long-Form Text-to-Music Generation with Adaptive Prompts: A Case Study in Tabletop Role-Playing Games Soundtracks
- Abstract
- 摘要
Can Knowledge Graphs Make Large Language Models More Trustworthy? An Empirical Study Over Open-ended Question Answering
- Abstract
- 摘要
GLEE: A Unified Framework and Benchmark for Language-based Economic Environments
- Abstract
- 摘要
MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems
- Abstract
- 摘要
Evaluating LLM-based Approaches to Legal Citation Prediction: Domain-specific Pre-training, Fine-tuning, or RAG? A Benchmark and an Australian Law Case Study
- Abstract
- 摘要
Breaking Information Cocoons: A Hyperbolic Graph-LLM Framework for Exploration and Exploitation in Recommender Systems
- Abstract
- 摘要
Code Readability in the Age of Large Language Models: An Industrial Case Study from Atlassian
- Abstract
- 摘要
BenCzechMark : A Czech-centric Multitask and Multimetric Benchmark for Large Language Models with Duel Scoring Mechanism
- Abstract
- 摘要
LITA: An Efficient LLM-assisted Iterative Topic Augmentation Framework
- Abstract
- 摘要
Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data
- Abstract
- 摘要
Transferring Textual Preferences to Vision-Language Understanding through Model Merging
- Abstract
- 摘要
InSTA: Towards Internet-Scale Training For Agents
- Abstract
- 摘要
C-3PO: Compact Plug-and-Play Proxy Optimization to Achieve Human-like Retrieval-Augmented Generation
- Abstract
- 摘要
Agentic AI Software Engineers: Programming with Trust
- Abstract
- 摘要
SelfCite: Self-Supervised Alignment for Context Attribution in Large Language Models
- Abstract
- 摘要
No Need for Explanations: LLMs can implicitly learn from mistakes in-context
- Abstract
- 摘要
Prot2Chat: Protein LLM with Early-Fusion of Text, Sequence and Structure
- Abstract
- 摘要
GRIFFIN: Effective Token Alignment for Faster Speculative Decoding
- Abstract
- 摘要
The Hidden Dimensions of LLM Alignment: A Multi-Dimensional Analysis of Orthogonal Safety Directions
- Abstract
- 摘要
Slamming: Training a Speech Language Model on One GPU in a Day
- Abstract
- 摘要
CoT-ICL Lab: A Synthetic Framework for Studying Chain-of-Thought Learning from In-Context Demonstrations
- Abstract
- 摘要
Collab-Overcooked: Benchmarking and Evaluating Large Language Models as Collaborative Agents
- Abstract
- 摘要
Steer LLM Latents for Hallucination Detection
- Abstract
- 摘要
HybridNorm: Towards Stable and Efficient Transformer Training via Hybrid Normalization
- Abstract
- 摘要
Capacity-Aware Inference: Mitigating the Straggler Effect in Mixture of Experts
- Abstract
- 摘要
From 1,000,000 Users to Every User: Scaling Up Personalized Preference for User-level Alignment
- Abstract
- 摘要
Dion: Distributed Orthonormalized Updates
- Abstract
- 摘要
Hallucination Detection in LLMs with Topological Divergence on Attention Graphs
- Abstract
- 摘要
MindGYM: What Matters in Question Synthesis for Thinking-Centric Fine-Tuning?
- Abstract
- 摘要
Robust and Fine-Grained Detection of AI Generated Texts
- Abstract
- 摘要
ASMA-Tune: Unlocking LLMs' Assembly Code Comprehension via Structural-Semantic Instruction Tuning
- Abstract
- 摘要
Improving Multilingual Capabilities with Cultural and Local Knowledge in Large Language Models While Enhancing Native Performance
- Abstract
- 摘要

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract​

摘要​

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要

Abstract

摘要